summaryrefslogtreecommitdiffstats
path: root/fs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-10-1216-159/+259
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge more updates from Andrew Morton: - a few block updates that fell in my lap - lib/ updates - checkpatch - autofs - ipc - a ton of misc other things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits) mm: split gfp_mask and mapping flags into separate fields fs: use mapping_set_error instead of opencoded set_bit treewide: remove redundant #include <linux/kconfig.h> hung_task: allow hung_task_panic when hung_task_warnings is 0 kthread: add kerneldoc for kthread_create() kthread: better support freezable kthread workers kthread: allow to modify delayed kthread work kthread: allow to cancel kthread work kthread: initial support for delayed kthread work kthread: detect when a kthread work is used by more workers kthread: add kthread_destroy_worker() kthread: add kthread_create_worker*() kthread: allow to call __kthread_create_on_node() with va_list args kthread/smpboot: do not park in kthread_create_on_cpu() kthread: kthread worker API cleanup kthread: rename probe_kthread_data() to kthread_probe_data() scripts/tags.sh: enable code completion in VIM mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping kdump, vmcoreinfo: report memory sections virtual addresses ipc/sem.c: add cond_resched in exit_sme ...
| * fs: use mapping_set_error instead of opencoded set_bitMichal Hocko2016-10-126-10/+8
| | | | | | | | | | | | | | | | | | | | | | The mapping_set_error() helper sets the correct AS_ flag for the mapping so there is no reason to open code it. Use the helper directly. [akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO] Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * treewide: remove redundant #include <linux/kconfig.h>Masahiro Yamada2016-10-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel source files need not include <linux/kconfig.h> explicitly because the top Makefile forces to include it with: -include $(srctree)/include/linux/kconfig.h This commit removes explicit includes except the following: * arch/s390/include/asm/facilities_src.h * tools/testing/radix-tree/linux/kernel.h These two are used for host programs. Link: http://lkml.kernel.org/r/1473656164-11929-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pipe: cap initial pipe capacity according to pipe-max-size limitMichael Kerrisk (man-pages)2016-10-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a patch that provides behavior that is more consistent, and probably less surprising to users. I consider the change optional, and welcome opinions about whether it should be applied. By default, pipes are created with a capacity of 64 kiB. However, /proc/sys/fs/pipe-max-size may be set smaller than this value. In this scenario, an unprivileged user could thus create a pipe whose initial capacity exceeds the limit. Therefore, it seems logical to cap the initial pipe capacity according to the value of pipe-max-size. The test program shown earlier in this patch series can be used to demonstrate the effect of the change brought about with this patch: # cat /proc/sys/fs/pipe-max-size 1048576 # sudo -u mtk ./test_F_SETPIPE_SZ 1 Initial pipe capacity: 65536 # echo 10000 > /proc/sys/fs/pipe-max-size # cat /proc/sys/fs/pipe-max-size 16384 # sudo -u mtk ./test_F_SETPIPE_SZ 1 Initial pipe capacity: 16384 # ./test_F_SETPIPE_SZ 1 Initial pipe capacity: 65536 The last two executions of 'test_F_SETPIPE_SZ' show that pipe-max-size caps the initial allocation for a new pipe for unprivileged users, but not for privileged users. Link: http://lkml.kernel.org/r/31dc7064-2a17-9c5b-1df1-4e3012ee992c@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pipe: make account_pipe_buffers() return a value, and use itMichael Kerrisk (man-pages)2016-10-121-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an optional patch, to provide a small performance improvement. Alter account_pipe_buffers() so that it returns the new value in user->pipe_bufs. This means that we can refactor too_many_pipe_buffers_soft() and too_many_pipe_buffers_hard() to avoid the costs of repeated use of atomic_long_read() to get the value user->pipe_bufs. Link: http://lkml.kernel.org/r/93e5f193-1e5e-3e1f-3a20-eae79b7e1310@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pipe: fix limit checking in alloc_pipe_info()Michael Kerrisk (man-pages)2016-10-121-7/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The limit checking in alloc_pipe_info() (used by pipe(2) and when opening a FIFO) has the following problems: (1) When checking capacity required for the new pipe, the checks against the limit in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing consumption, and exclude the memory required for the new pipe capacity. As a consequence: (1) the memory allocation throttling provided by the soft limit does not kick in quite as early as it should, and (2) the user can overrun the hard limit. (2) As currently implemented, accounting and checking against the limits is done as follows: (a) Test whether the user has exceeded the limit. (b) Make new pipe buffer allocation. (c) Account new allocation against the limits. This is racey. Multiple processes may pass point (a) simultaneously, and then allocate pipe buffers that are accounted for only in step (c). The race means that the user's pipe buffer allocation could be pushed over the limit (by an arbitrary amount, depending on how unlucky we were in the race). [Thanks to Vegard Nossum for spotting this point, which I had missed.] This patch addresses the above problems as follows: * Alter the checks against limits to include the memory required for the new pipe. * Re-order the accounting step so that it precedes the buffer allocation. If the accounting step determines that a limit has been reached, revert the accounting and cause the operation to fail. Link: http://lkml.kernel.org/r/8ff3e9f9-23f6-510c-644f-8e70cd1c0bd9@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pipe: simplify logic in alloc_pipe_info()Michael Kerrisk (man-pages)2016-10-121-22/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace an 'if' block that covers most of the code in this function with a 'goto'. This makes the code a little simpler to read, and also simplifies the next patch (fix limit checking in alloc_pipe_info()) Link: http://lkml.kernel.org/r/aef030c1-0257-98a9-4988-186efa48530c@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pipe: fix limit checking in pipe_set_size()Michael Kerrisk (man-pages)2016-10-121-10/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ)) has the following problems: (1) When increasing the pipe capacity, the checks against the limits in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing consumption, and exclude the memory required for the increased pipe capacity. The new increase in pipe capacity can then push the total memory used by the user for pipes (possibly far) over a limit. This can also trigger the problem described next. (2) The limit checks are performed even when the new pipe capacity is less than the existing pipe capacity. This can lead to problems if a user sets a large pipe capacity, and then the limits are lowered, with the result that the user will no longer be able to decrease the pipe capacity. (3) As currently implemented, accounting and checking against the limits is done as follows: (a) Test whether the user has exceeded the limit. (b) Make new pipe buffer allocation. (c) Account new allocation against the limits. This is racey. Multiple processes may pass point (a) simultaneously, and then allocate pipe buffers that are accounted for only in step (c). The race means that the user's pipe buffer allocation could be pushed over the limit (by an arbitrary amount, depending on how unlucky we were in the race). [Thanks to Vegard Nossum for spotting this point, which I had missed.] This patch addresses the above problems as follows: * Perform checks against the limits only when increasing a pipe's capacity; an unprivileged user can always decrease a pipe's capacity. * Alter the checks against limits to include the memory required for the new pipe capacity. * Re-order the accounting step so that it precedes the buffer allocation. If the accounting step determines that a limit has been reached, revert the accounting and cause the operation to fail. The program below can be used to demonstrate problems 1 and 2, and the effect of the fix. The program takes one or more command-line arguments. The first argument specifies the number of pipes that the program should create. The remaining arguments are, alternately, pipe capacities that should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in seconds) between the fcntl() operations. (The sleep intervals allow the possibility to change the limits between fcntl() operations.) Problem 1 ========= Using the test program on an unpatched kernel, we first set some limits: # echo 0 > /proc/sys/fs/pipe-user-pages-soft # echo 1000000000 > /proc/sys/fs/pipe-max-size # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB Then show that we can set a pipe with capacity (100MB) that is over the hard limit # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 Initial pipe capacity: 65536 Loop 1: set pipe capacity to 100000000 bytes F_SETPIPE_SZ returned 134217728 Now set the capacity to 100MB twice. The second call fails (which is probably surprising to most users, since it seems like a no-op): # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000 Initial pipe capacity: 65536 Loop 1: set pipe capacity to 100000000 bytes F_SETPIPE_SZ returned 134217728 Loop 2: set pipe capacity to 100000000 bytes Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted With a patched kernel, setting a capacity over the limit fails at the first attempt: # echo 0 > /proc/sys/fs/pipe-user-pages-soft # echo 1000000000 > /proc/sys/fs/pipe-max-size # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 Initial pipe capacity: 65536 Loop 1: set pipe capacity to 100000000 bytes Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted There is a small chance that the change to fix this problem could break user-space, since there are cases where fcntl(F_SETPIPE_SZ) calls that previously succeeded might fail. However, the chances are small, since (a) the pipe-user-pages-{soft,hard} limits are new (in 4.5), and the default soft/hard limits are high/unlimited. Therefore, it seems warranted to make these limits operate more precisely (and behave more like what users probably expect). Problem 2 ========= Running the test program on an unpatched kernel, we first set some limits: # getconf PAGESIZE 4096 # echo 0 > /proc/sys/fs/pipe-user-pages-soft # echo 1000000000 > /proc/sys/fs/pipe-max-size # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB Now perform two fcntl(F_SETPIPE_SZ) operations on a single pipe, first setting a pipe capacity (10MB), sleeping for a few seconds, during which time the hard limit is lowered, and then set pipe capacity to a smaller amount (5MB): # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 & [1] 748 # Initial pipe capacity: 65536 Loop 1: set pipe capacity to 10000000 bytes F_SETPIPE_SZ returned 16777216 Sleeping 15 seconds # echo 1000 > /proc/sys/fs/pipe-user-pages-hard # 4.096 MB # Loop 2: set pipe capacity to 5000000 bytes Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted In this case, the user should be able to lower the limit. With a kernel that has the patch below, the second fcntl() succeeds: # echo 0 > /proc/sys/fs/pipe-user-pages-soft # echo 1000000000 > /proc/sys/fs/pipe-max-size # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 & [1] 3215 # Initial pipe capacity: 65536 # Loop 1: set pipe capacity to 10000000 bytes F_SETPIPE_SZ returned 16777216 Sleeping 15 seconds # echo 1000 > /proc/sys/fs/pipe-user-pages-hard # Loop 2: set pipe capacity to 5000000 bytes F_SETPIPE_SZ returned 8388608 8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x--- /* test_F_SETPIPE_SZ.c (C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity and interactions with limits defined by /proc/sys/fs/pipe-* files. */ #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> int main(int argc, char *argv[]) { int (*pfd)[2]; int npipes; int pcap, rcap; int j, p, s, stime, loop; if (argc < 2) { fprintf(stderr, "Usage: %s num-pipes " "[pipe-capacity sleep-time]...\n", argv[0]); exit(EXIT_FAILURE); } npipes = atoi(argv[1]); pfd = calloc(npipes, sizeof (int [2])); if (pfd == NULL) { perror("calloc"); exit(EXIT_FAILURE); } for (j = 0; j < npipes; j++) { if (pipe(pfd[j]) == -1) { fprintf(stderr, "Loop %d: pipe() failed: ", j); perror("pipe"); exit(EXIT_FAILURE); } } printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ)); for (j = 2; j < argc; j += 2 ) { loop = j / 2; pcap = atoi(argv[j]); printf(" Loop %d: set pipe capacity to %d bytes\n", loop, pcap); for (p = 0; p < npipes; p++) { s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap); if (s == -1) { fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ " "failed: ", loop, p); perror("fcntl"); exit(EXIT_FAILURE); } if (p == 0) { printf(" F_SETPIPE_SZ returned %d\n", s); rcap = s; } else { if (s != rcap) { fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ " "unexpected return: %d\n", loop, p, s); exit(EXIT_FAILURE); } } stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0; if (stime > 0) { printf(" Sleeping %d seconds\n", stime); sleep(stime); } } } exit(EXIT_SUCCESS); } 8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x--- Patch history: v2 * Switch order of test in 'if' statement to avoid function call (to capability()) in normal path. [This is a fix to a preexisting wart in the code. Thanks to Willy Tarreau] * Perform (size > pipe_max_size) check before calling account_pipe_buffers(). [Thanks to Vegard Nossum] Quoting Vegard: The potential problem happens if the user passes a very large number which will overflow pipe->user->pipe_bufs. On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX then round_pipe_size() returns INT_MAX. Although it's true that the accounting is done in terms of pages and not bytes, so you'd need on the order of (1 << 13) = 8192 processes hitting the limit at the same time in order to make it overflow, which seems a bit unlikely. (See https://lkml.org/lkml/2016/8/12/215 for another discussion on the limit checking) Link: http://lkml.kernel.org/r/1e464945-536b-2420-798b-e77b9c7e8593@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pipe: refactor argument for account_pipe_buffers()Michael Kerrisk (man-pages)2016-10-121-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a preparatory patch for following work. account_pipe_buffers() performs accounting in the 'user_struct'. There is no need to pass a pointer to a 'pipe_inode_info' struct (which is then dereferenced to obtain a pointer to the 'user' field). Instead, pass a pointer directly to the 'user_struct'. This change is needed in preparation for a subsequent patch that the fixes the limit checking in alloc_pipe_info() (and the resulting code is a little more logical). Link: http://lkml.kernel.org/r/7277bf8c-a6fc-4a7d-659c-f5b145c981ab@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pipe: move limit checking logic into pipe_set_size()Michael Kerrisk (man-pages)2016-10-121-23/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a preparatory patch for following work. Move the F_SETPIPE_SZ limit-checking logic from pipe_fcntl() into pipe_set_size(). This simplifies the code a little, and allows for reworking required in a later patch that fixes the limit checking in pipe_set_size() Link: http://lkml.kernel.org/r/3701b2c5-2c52-2c3e-226d-29b9deb29b50@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * pipe: relocate round_pipe_size() above pipe_set_size()Michael Kerrisk (man-pages)2016-10-121-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "pipe: fix limit handling", v2. When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged users are exceeding limits on memory consumption. While documenting and testing the operation of these limits I noticed that, as currently implemented, these checks have a number of problems: (1) When increasing the pipe capacity, the checks against the limits in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing consumption, and exclude the memory required for the increased pipe capacity. The new increase in pipe capacity can then push the total memory used by the user for pipes (possibly far) over a limit. This can also trigger the problem described next. (2) The limit checks are performed even when the new pipe capacity is less than the existing pipe capacity. This can lead to problems if a user sets a large pipe capacity, and then the limits are lowered, with the result that the user will no longer be able to decrease the pipe capacity. (3) As currently implemented, accounting and checking against the limits is done as follows: (a) Test whether the user has exceeded the limit. (b) Make new pipe buffer allocation. (c) Account new allocation against the limits. This is racey. Multiple processes may pass point (a) simultaneously, and then allocate pipe buffers that are accounted for only in step (c). The race means that the user's pipe buffer allocation could be pushed over the limit (by an arbitrary amount, depending on how unlucky we were in the race). [Thanks to Vegard Nossum for spotting this point, which I had missed.] This patch series addresses these three problems. This patch (of 8): This is a minor preparatory patch. After subsequent patches, round_pipe_size() will be called from pipe_set_size(), so place round_pipe_size() above pipe_set_size(). Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: refactor ioctl fn vector in iookup_dev_ioctl()Tomohiro Kusumi2016-10-121-33/+16
| | | | | | | | | | | | | | | | | | | | | | cmd part of this struct is the same as an index of itself within _ioctls[]. In fact this cmd is unused, so we can drop this part. Link: http://lkml.kernel.org/r/20160831033414.9910.66697.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: remove possibly misleading /* #define DEBUG */Tomohiro Kusumi2016-10-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having this in autofs_i.h gives illusion that uncommenting this enables pr_debug(), but it doesn't enable all the pr_debug() in autofs because inclusion order matters. XFS has the same DEBUG macro in its core header fs/xfs/xfs.h, however XFS seems to have a rule to include this prior to other XFS headers as well as kernel headers. This is not the case with autofs, and DEBUG could be enabled via Makefile, so autofs should just get rid of this comment to make the code less confusing. It's a comment, so there is literally no functional difference. Link: http://lkml.kernel.org/r/20160831033409.9910.77067.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: fix print format for ioctl warning messageTomohiro Kusumi2016-10-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | All other warnings use "cmd(0x%08x)" and this is the only one with "cmd(%d)". (below comes from my userspace debug program, but not automount daemon) [ 1139.905676] autofs4:pid:1640:check_dev_ioctl_version: ioctl control interface version mismatch: kernel(1.0), user(0.0), cmd(-1072131215) Link: http://lkml.kernel.org/r/20160812024851.12352.75458.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: add autofs_dev_ioctl_version() for AUTOFS_DEV_IOCTL_VERSION_CMDIan Kent2016-10-121-8/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No functional changes, based on the following justification. 1. Make the code more consistent using the ioctl vector _ioctls[], rather than assigning NULL only for this ioctl command. 2. Remove goto done; for better maintainability in the long run. 3. The existing code is based on the fact that validate_dev_ioctl() sets ioctl version for any command, but AUTOFS_DEV_IOCTL_VERSION_CMD should explicitly set it regardless of the default behavior. Link: http://lkml.kernel.org/r/20160812024846.12352.9885.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: fix dev ioctl number range checkIan Kent2016-10-122-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The count of miscellaneous device ioctls in fs/autofs4/autofs_i.h is wrong. The number of ioctls is the difference between AUTOFS_DEV_IOCTL_VERSION_CMD and AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD (14) not the difference between AUTOFS_IOC_COUNT and 11 (21). [kusumi.tomohiro@gmail.com: fix typo that made the count macro negative] Link: http://lkml.kernel.org/r/20160831033420.9910.16809.stgit@pluto.themaw.net Link: http://lkml.kernel.org/r/20160812024841.12352.11975.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: fix pr_debug() messageTomohiro Kusumi2016-10-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This isn't a return value, so change the message to indicate the status is the result of may_umount(). (or locate pr_debug() after put_user() with the same message) Link: http://lkml.kernel.org/r/20160812024836.12352.74628.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: don't fail to free_dev_ioctl(param)Tomohiro Kusumi2016-10-121-1/+2
| | | | | | | | | | | | | | | | | | | | Returning -ENOTTY here fails to free dynamically allocated param. Link: http://lkml.kernel.org/r/20160812024815.12352.69153.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: remove obsolete sb fieldsTomohiro Kusumi2016-10-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | These two were left from commit aa55ddf340c9 ("autofs4: remove unused ioctls") which removed unused ioctls. Link: http://lkml.kernel.org/r/20160812024810.12352.96377.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: use autofs4_free_ino() to kfree dentry dataTomohiro Kusumi2016-10-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | kfree dentry data allocated by autofs4_new_ino() with autofs4_free_ino() instead of raw kfree. (since we have the interface to free autofs_info*) This patch was modified to remove the need to set the dentry info field to NULL dew to a change in the previous patch. Link: http://lkml.kernel.org/r/20160812024805.12352.43650.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: remove ino free in autofs4_dir_symlink()Ian Kent2016-10-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The inode allocation failure case in autofs4_dir_symlink() frees the autofs dentry info of the dentry without setting ->d_fsdata to NULL. That could lead to a double free so just get rid of the free and leave it to ->d_release(). Link: http://lkml.kernel.org/r/20160812024759.12352.10653.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: add WARN_ON(1) for non dir/link inode caseTomohiro Kusumi2016-10-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | It's invalid if the given mode is neither dir nor link, so warn on else case. Link: http://lkml.kernel.org/r/20160812024754.12352.8536.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: fix autofs4_fill_super() error exit handlingIan Kent2016-10-121-3/+3
| | | | | | | | | | | | | | | | | | | | Somewhere along the line the error handling gotos have become incorrect. Link: http://lkml.kernel.org/r/20160812024749.12352.15100.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: test autofs versions first on sb initializationTomohiro Kusumi2016-10-121-17/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch does what the below comment says. It could be and it's considered better to do this first before various functions get called during initialization. /* Couldn't this be tested earlier? */ Link: http://lkml.kernel.org/r/20160812024744.12352.43075.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * autofs: drop unnecessary extern in autofs_i.hTomohiro Kusumi2016-10-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | autofs4_kill_sb() doesn't need to be declared as extern, and no other functions in .h are explicitly declared as extern. Link: http://lkml.kernel.org/r/20160812024739.12352.99354.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * fs/select: add vmalloc fallback for select(2)Vlastimil Babka2016-10-121-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows with the number of fds passed. We had a customer report page allocation failures of order-4 for this allocation. This is a costly order, so it might easily fail, as the VM expects such allocation to have a lower-order fallback. Such trivial fallback is vmalloc(), as the memory doesn't have to be physically contiguous and the allocation is temporary for the duration of the syscall only. There were some concerns, whether this would have negative impact on the system by exposing vmalloc() to userspace. Although an excessive use of vmalloc can cause some system wide performance issues - TLB flushes etc. - a large order allocation is not for free either and an excessive reclaim/compaction can have a similar effect. Also note that the size is effectively limited by RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the bitmaps will fit well within single page and thus the vmalloc() fallback could be only excercised for processes where root allows a higher limit. Note that the poll(2) syscall seems to use a linked list of order-0 pages, so it doesn't need this kind of fallback. [eric.dumazet@gmail.com: fix failure path logic] [akpm@linux-foundation.org: use proper type for size] Link: http://lkml.kernel.org/r/20160927084536.5923-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * block: implement (some of) fallocate for block devicesDarrick J. Wong2016-10-122-1/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After much discussion, it seems that the fallocate feature flag FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted for zeroing SCSI UNMAP. Punch still requires that FALLOC_FL_KEEP_SIZE is set. A length that goes past the end of the device will be clamped to the device size if KEEP_SIZE is set; or will return -EINVAL if not. Both start and length must be aligned to the device's logical block size. Since the semantics of fallocate are fairly well established already, wire up the two pieces. The other fallocate variants (collapse range, insert range, and allocate blocks) are not supported. Link: http://lkml.kernel.org/r/147518379992.22791.8849838163218235007.stgit@birch.djwong.org Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> # tweaked header Cc: Brian Foster <bfoster@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * ocfs2: fix memory leak in dlm_migrate_request_handler()Guozhonghua2016-10-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the dlm_migrate_request_handler(), when `ret' is -EEXIST, the mle should be freed, otherwise the memory will be leaked. Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D3522A@H3CMLB12-EX.srv.huawei-3com.com Signed-off-by: Guozhonghua <guozhonghua@h3c.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com> Cc: Eric Ren <zren@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'libnvdimm-for-4.9' of ↵Linus Torvalds2016-10-111-0/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "Aside from the recently added pmem sub-division support these have been in -next for several releases with no reported issues. The sub- division support was included in next-20161010 with no reported issues. It passes all unit tests including new tests for all the new functionality below. Summary: - PMEM sub-division support: Allow a single PMEM region to be divided into multiple namespaces. Originally, ~2 years ago, it was thought that partitions of a /dev/pmemX block device could handle sub-allocations of persistent memory for different use cases. With the decision to not support DAX mappings of raw block-devices, and the genesis of device-dax, the need for having multiple pmem-namespace per region has grown. - Device-DAX unified inode: In support of dynamic-resizing of a device-dax instance the kernel arranges for all mappings of a device-dax node to share the same inode. This allows unmap / truncate / invalidation events to affect all instances of the device similar to the behavior of mmap on block devices. - Hardware error scrubbing reworks: The original address-range-scrub and badblocks tracking solution allowed clearing entries at the individual namespace level, but it failed to clear the internal list of media errors maintained at the bus level. The result was that the next scrub or namespace disable/re-enable event would restore the cleared badblocks, but now that is fixed. The v4.8 kernel introduced an auto-scrub-on-machine-check behavior to repopulate the badblocks list. Now, in v4.9, the auto-scrub behavior can be disabled and simply arrange for the error reported in the machine-check to be added to the list. - DIMM health-event notification support: ACPI 6.1 defines a notification event code that can be send to ACPI NVDIMM devices. A poll(2) capable file descriptor for these events can be obtained from the nmemX/nfit/flags sysfs-attribute of a libnvdimm memory device. - Miscellaneous fixes: NVDIMM-N probe error, device-dax build error, and a change to dedup the flush hint list to not flush the memory controller more than necessary" * tag 'libnvdimm-for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits) /dev/dax: fix Kconfig dependency build breakage dax: use correct dev_t value dax: convert devm_create_dax_dev to PTR_ERR libnvdimm, namespace: allow creation of multiple pmem-namespaces per region libnvdimm, namespace: lift single pmem limit in scan_labels() libnvdimm, namespace: filter out of range labels in scan_labels() libnvdimm, namespace: enable allocation of multiple pmem namespaces libnvdimm, namespace: update label implementation for multi-pmem libnvdimm, namespace: expand pmem device naming scheme for multi-pmem libnvdimm, region: update nd_region_available_dpa() for multi-pmem support libnvdimm, namespace: sort namespaces by dpa at init libnvdimm, namespace: allow multiple pmem-namespaces per region at scan time tools/testing/nvdimm: support for sub-dividing a pmem region libnvdimm, namespace: unify blk and pmem label scanning libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() helper libnvdimm, label: convert label tracking to a linked list libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc nvdimm: reduce duplicated wpq flushes libnvdimm: clear the internal poison_list when clearing badblocks pmem: reduce kmap_atomic sections to the memcpys only ...
| * \ Merge branch 'for-4.9/dax' into libnvdimm-for-nextDan Williams2016-10-081-0/+1
| |\ \
| | * | dax: define a unified inode/address_space for device-dax mappingsDan Williams2016-08-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In support of enabling resize / truncate of device-dax instances, define a pseudo-fs to provide a unified inode/address space for vm operations. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | | | Merge branch 'for-linus-4.9' of ↵Linus Torvalds2016-10-1143-1071/+1563
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This is a big variety of fixes and cleanups. Liu Bo continues to fixup fuzzer related problems, and some of Josef's cleanups are prep for his bigger extent buffer changes (slated for v4.10)" * 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits) Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf Btrfs: don't BUG() during drop snapshot btrfs: fix btrfs_no_printk stub helper Btrfs: memset to avoid stale content in btree leaf btrfs: parent_start initialization cleanup btrfs: Remove already completed TODO comment btrfs: Do not reassign count in btrfs_run_delayed_refs btrfs: fix a possible umount deadlock Btrfs: fix memory leak in do_walk_down btrfs: btrfs_debug should consume fs_info when DEBUG is not defined btrfs: convert send's verbose_printk to btrfs_debug btrfs: convert pr_* to btrfs_* where possible btrfs: convert printk(KERN_* to use pr_* calls btrfs: unsplit printed strings btrfs: clean the old superblocks before freeing the device Btrfs: kill BUG_ON in run_delayed_tree_ref Btrfs: don't leak reloc root nodes on error btrfs: squash lines for simple wrapper functions Btrfs: improve check_node to avoid reading corrupted nodes ...
| * | | | Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"Chris Mason2016-10-102-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 5d8eb6fe517583f9c6d5b94faf2254a0207a45c9. When we remove devices, we free the device structures. Delaying btfs_remove_chunk() ends up hitting a use-after-free on them. Signed-off-by: Chris Mason <clm@fb.com>
| * | | | Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leafLiu Bo2016-09-261-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we're not able to get enough space through splitting leaf, we'd create a new sibling leaf instead, and it's possible that we return a zero-nritem sibling leaf and mark it dirty before it's in a consistent state. With CONFIG_BTRFS_FS_CHECK_INTEGRITY=y, the integrity check of check_leaf will report panic due to this zero-nritem non-root leaf. This removes the unnecessary btrfs_mark_buffer_dirty. Reported-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | Btrfs: don't BUG() during drop snapshotJosef Bacik2016-09-261-11/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Really there's lots of things that can go wrong here, kill all the BUG_ON()'s and replace the logic ones with ASSERT()'s and return EIO instead. Signed-off-by: Josef Bacik <jbacik@fb.com> [ switched to btrfs_err, errors go to common label ] Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: fix btrfs_no_printk stub helperArnd Bergmann2016-09-261-10/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The addition of btrfs_no_printk() caused a build failure when CONFIG_PRINTK is disabled: fs/btrfs/send.c: In function 'send_rename': fs/btrfs/ctree.h:3367:2: error: implicit declaration of function 'btrfs_no_printk' [-Werror=implicit-function-declaration] This moves the helper outside of that #ifdef so it is always defined, and changes the existing #ifdef to refer to that helper as well for consistency. Fixes: 47c57058ff2c ("btrfs: btrfs_debug should consume fs_info when DEBUG is not defined") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | Btrfs: memset to avoid stale content in btree leafLiu Bo2016-09-263-19/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an additional patch to "Btrfs: memset to avoid stale content in btree node block". This uses memset to initialize the unused space in a leaf to avoid potential stale content, which may be incurred by pushing items between sibling leaves. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: parent_start initialization cleanupGoldwyn Rodrigues2016-09-261-15/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code cleanup. parent_start is initialized multiple times when it is not necessary to do so. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: Remove already completed TODO commentGoldwyn Rodrigues2016-09-261-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes: 7cf5b97650f2 ("btrfs: qgroup: Cleanup old inaccurate facilities") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: Do not reassign count in btrfs_run_delayed_refsGoldwyn Rodrigues2016-09-261-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code cleanup. count is already (unsgined long)-1. That is the reason run_all was set. Do not reassign it (unsigned long)-1. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: fix a possible umount deadlockAnand Jain2016-09-261-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_show_devname() is using the device_list_mutex, sometimes a call to blkdev_put() leads vfs calling into this func. So call blkdev_put() outside of device_list_mutex, as of now. [ 983.284212] ====================================================== [ 983.290401] [ INFO: possible circular locking dependency detected ] [ 983.296677] 4.8.0-rc5-ceph-00023-g1b39cec2 #1 Not tainted [ 983.302081] ------------------------------------------------------- [ 983.308357] umount/21720 is trying to acquire lock: [ 983.313243] (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff9128ec51>] blkdev_put+0x31/0x150 [ 983.321264] [ 983.321264] but task is already holding lock: [ 983.327101] (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs] [ 983.337839] [ 983.337839] which lock already depends on the new lock. [ 983.337839] [ 983.346024] [ 983.346024] the existing dependency chain (in reverse order) is: [ 983.353512] -> #4 (&fs_devs->device_list_mutex){+.+...}: [ 983.359096] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.365143] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.371521] [<ffffffffc02d8116>] btrfs_show_devname+0x36/0x1f0 [btrfs] [ 983.378710] [<ffffffff9129523e>] show_vfsmnt+0x4e/0x150 [ 983.384593] [<ffffffff9126ffc7>] m_show+0x17/0x20 [ 983.389957] [<ffffffff91276405>] seq_read+0x2b5/0x3b0 [ 983.395669] [<ffffffff9124c808>] __vfs_read+0x28/0x100 [ 983.401464] [<ffffffff9124eb3b>] vfs_read+0xab/0x150 [ 983.407080] [<ffffffff9124ec32>] SyS_read+0x52/0xb0 [ 983.412609] [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 983.419617] -> #3 (namespace_sem){++++++}: [ 983.424024] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.430074] [<ffffffff918239e9>] down_write+0x49/0x80 [ 983.435785] [<ffffffff91272457>] lock_mount+0x67/0x1c0 [ 983.441582] [<ffffffff91272ab2>] do_add_mount+0x32/0xf0 [ 983.447458] [<ffffffff9127363a>] finish_automount+0x5a/0xc0 [ 983.453682] [<ffffffff91259513>] follow_managed+0x1b3/0x2a0 [ 983.459912] [<ffffffff9125b750>] lookup_fast+0x300/0x350 [ 983.465875] [<ffffffff9125d6e7>] path_openat+0x3a7/0xaa0 [ 983.471846] [<ffffffff9125ef75>] do_filp_open+0x85/0xe0 [ 983.477731] [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0 [ 983.483702] [<ffffffff9124c4de>] SyS_open+0x1e/0x20 [ 983.489240] [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 983.496254] -> #2 (&sb->s_type->i_mutex_key#3){+.+.+.}: [ 983.501798] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.507855] [<ffffffff918239e9>] down_write+0x49/0x80 [ 983.513558] [<ffffffff91366237>] start_creating+0x87/0x100 [ 983.519703] [<ffffffff91366647>] debugfs_create_dir+0x17/0x100 [ 983.526195] [<ffffffff911df153>] bdi_register+0x93/0x210 [ 983.532165] [<ffffffff911df313>] bdi_register_owner+0x43/0x70 [ 983.538570] [<ffffffff914080fb>] device_add_disk+0x1fb/0x450 [ 983.544888] [<ffffffff91580226>] loop_add+0x1e6/0x290 [ 983.550596] [<ffffffff91fec358>] loop_init+0x10b/0x14f [ 983.556394] [<ffffffff91002207>] do_one_initcall+0xa7/0x180 [ 983.562618] [<ffffffff91f932e0>] kernel_init_freeable+0x1cc/0x266 [ 983.569370] [<ffffffff918174be>] kernel_init+0xe/0x100 [ 983.575166] [<ffffffff9182620f>] ret_from_fork+0x1f/0x40 [ 983.581131] -> #1 (loop_index_mutex){+.+.+.}: [ 983.585801] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.591858] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.598256] [<ffffffff9157ed3f>] lo_open+0x1f/0x60 [ 983.603704] [<ffffffff9128eec3>] __blkdev_get+0x123/0x400 [ 983.609757] [<ffffffff9128f4ea>] blkdev_get+0x34a/0x350 [ 983.615639] [<ffffffff9128f554>] blkdev_open+0x64/0x80 [ 983.621428] [<ffffffff9124aff6>] do_dentry_open+0x1c6/0x2d0 [ 983.627651] [<ffffffff9124c029>] vfs_open+0x69/0x80 [ 983.633181] [<ffffffff9125db74>] path_openat+0x834/0xaa0 [ 983.639152] [<ffffffff9125ef75>] do_filp_open+0x85/0xe0 [ 983.645035] [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0 [ 983.650999] [<ffffffff9124c4de>] SyS_open+0x1e/0x20 [ 983.656535] [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 983.663541] -> #0 (&bdev->bd_mutex){+.+.+.}: [ 983.668107] [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0 [ 983.674510] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.680561] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.686967] [<ffffffff9128ec51>] blkdev_put+0x31/0x150 [ 983.692761] [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs] [ 983.699699] [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs] [ 983.707178] [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs] [ 983.714380] [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs] [ 983.721061] [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs] [ 983.727908] [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100 [ 983.734744] [<ffffffff91250f56>] kill_anon_super+0x16/0x30 [ 983.740888] [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs] [ 983.747909] [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80 [ 983.754745] [<ffffffff912515fd>] deactivate_super+0x5d/0x70 [ 983.760977] [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80 [ 983.766773] [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20 [ 983.772738] [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0 [ 983.778708] [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4 [ 983.785373] [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0 [ 983.792212] [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1 [ 983.799225] [ 983.799225] other info that might help us debug this: [ 983.799225] [ 983.807291] Chain exists of: &bdev->bd_mutex --> namespace_sem --> &fs_devs->device_list_mutex [ 983.816521] Possible unsafe locking scenario: [ 983.816521] [ 983.822489] CPU0 CPU1 [ 983.827043] ---- ---- [ 983.831599] lock(&fs_devs->device_list_mutex); [ 983.836289] lock(namespace_sem); [ 983.842268] lock(&fs_devs->device_list_mutex); [ 983.849478] lock(&bdev->bd_mutex); [ 983.853127] [ 983.853127] *** DEADLOCK *** [ 983.853127] [ 983.859113] 3 locks held by umount/21720: [ 983.863145] #0: (&type->s_umount_key#35){++++..}, at: [<ffffffff912515f5>] deactivate_super+0x55/0x70 [ 983.872713] #1: (uuid_mutex){+.+.+.}, at: [<ffffffffc033d8d3>] btrfs_close_devices+0x23/0xa0 [btrfs] [ 983.882206] #2: (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs] [ 983.893422] [ 983.893422] stack backtrace: [ 983.897824] CPU: 6 PID: 21720 Comm: umount Not tainted 4.8.0-rc5-ceph-00023-g1b39cec2 #1 [ 983.905958] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015 [ 983.913492] 0000000000000000 ffff8c8a53c17a38 ffffffff91429521 ffffffff9260f4f0 [ 983.921018] ffffffff92642760 ffff8c8a53c17a88 ffffffff911b2b04 0000000000000050 [ 983.928542] ffffffff9237d620 ffff8c8a5294aee0 ffff8c8a5294aeb8 ffff8c8a5294aee0 [ 983.936072] Call Trace: [ 983.938545] [<ffffffff91429521>] dump_stack+0x85/0xc4 [ 983.943715] [<ffffffff911b2b04>] print_circular_bug+0x1fb/0x20c [ 983.949748] [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0 [ 983.955613] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.961123] [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150 [ 983.966550] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.972407] [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150 [ 983.977832] [<ffffffff9128ec51>] blkdev_put+0x31/0x150 [ 983.983101] [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs] [ 983.989500] [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs] [ 983.996415] [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs] [ 984.003068] [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs] [ 984.009189] [<ffffffff9126cc5e>] ? evict_inodes+0x15e/0x170 [ 984.014881] [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs] [ 984.021176] [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100 [ 984.027476] [<ffffffff91250f56>] kill_anon_super+0x16/0x30 [ 984.033082] [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs] [ 984.039548] [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80 [ 984.045839] [<ffffffff912515fd>] deactivate_super+0x5d/0x70 [ 984.051525] [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80 [ 984.056774] [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20 [ 984.062201] [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0 [ 984.067625] [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4 [ 984.073747] [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0 [ 984.080038] [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1 Reported-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | Btrfs: fix memory leak in do_walk_downLiu Bo2016-09-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The extent buffer 'next' needs to be free'd conditionally. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: btrfs_debug should consume fs_info when DEBUG is not definedJeff Mahoney2016-09-261-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can hit unused variable warnings when btrfs_debug and friends are just aliases for no_printk. This is due to the fs_info not getting consumed by the function call, which can happen if convenenience variables are used. This patch adds a new btrfs_no_printk static inline that consumes the convenience variable and does nothing else. It silences the unused variable warning and has no impact on the generated code: $ size fs/btrfs/extent_io.o* text data bss dec hex filename 44072 152 32 44256 ace0 fs/btrfs/extent_io.o.btrfs_no_printk 44072 152 32 44256 ace0 fs/btrfs/extent_io.o.no_printk Fixes: 27a0dd61a5 (Btrfs: make btrfs_debug match pr_debug handling related to DEBUG) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: convert send's verbose_printk to btrfs_debugJeff Mahoney2016-09-261-27/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was basically an open-coded, less flexible dynamic printk. We can just use btrfs_debug instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: convert pr_* to btrfs_* where possibleJeff Mahoney2016-09-2620-177/+231
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For many printks, we want to know which file system issued the message. This patch converts most pr_* calls to use the btrfs_* versions instead. In some cases, this means adding plumbing to allow call sites access to an fs_info pointer. fs/btrfs/check-integrity.c is left alone for another day. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: convert printk(KERN_* to use pr_* callsJeff Mahoney2016-09-2616-275/+205
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch converts printk(KERN_* style messages to use the pr_* versions. One side effect is that anything that was KERN_DEBUG is now automatically a dynamic debug message. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: unsplit printed stringsJeff Mahoney2016-09-2627-391/+324
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: clean the old superblocks before freeing the deviceJeff Mahoney2016-09-261-27/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_rm_device frees the block device but then re-opens it using the saved device name. A race exists between the close and the re-open that allows the block size to be changed. The result is getting stuck forever in the reclaim loop in __getblk_slow. This patch moves the superblock cleanup before closing the block device, which is also consistent with other callers. We also don't need a private copy of dev_name as the whole routine operates under the uuid_mutex. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | Btrfs: kill BUG_ON in run_delayed_tree_refLiu Bo2016-09-261-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a corrupted btrfs image, we can come across this BUG_ON and get an unreponsive system, but if we return errors instead, its caller can handle everything gracefully by aborting the current transaction. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | Btrfs: don't leak reloc root nodes on errorJosef Bacik2016-09-261-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't track the reloc roots in any sort of normal way, so the only way the root/commit_root nodes get free'd is if the relocation finishes successfully and the reloc root is deleted. Fix this by free'ing them in free_reloc_roots. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>