summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge remote-tracking branch 'rostedt/tip/perf/urgent-2' into ↵H. Peter Anvin2012-06-026-16/+154
|\ | | | | | | x86-urgent-for-linus
| * ftrace/x86: Do not change stacks in DEBUG when calling lockdepSteven Rostedt2012-06-011-3/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When both DYNAMIC_FTRACE and LOCKDEP are set, the TRACE_IRQS_ON/OFF will call into the lockdep code. The lockdep code can call lots of functions that may be traced by ftrace. When ftrace is updating its code and hits a breakpoint, the breakpoint handler will call into lockdep. If lockdep happens to call a function that also has a breakpoint attached, it will jump back into the breakpoint handler resetting the stack to the debug stack and corrupt the contents currently on that stack. The 'do_sym' call that calls do_int3() is protected by modifying the IST table to point to a different location if another breakpoint is hit. But the TRACE_IRQS_OFF/ON are outside that protection, and if a breakpoint is hit from those, the stack will get corrupted, and the kernel will crash: [ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002 [ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff [ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0 [ 1013.298832] Oops: 0010 [#1] PREEMPT SMP [ 1013.310600] CPU 2 [ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan] [ 1013.401848] [ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30 [ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff [ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408 EFLAGS: 00010046 [ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000 [ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20 [ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000 [ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000 [ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40 [ 1013.586108] FS: 0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000 [ 1013.609458] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0 [ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0) [ 1013.717028] Stack: [ 1013.724131] ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000 [ 1013.745918] cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627 [ 1013.767870] ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb [ 1013.790021] Call Trace: [ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff [ 1013.861443] RIP [<ffff88014630a000>] 0xffff880146309fff [ 1013.884466] RSP <ffff88014780f408> [ 1013.901507] CR2: 0000000000000002 The solution was to reuse the NMI functions that change the IDT table to make the debug stack keep its current stack (in kernel mode) when hitting a breakpoint: call debug_stack_set_zero TRACE_IRQS_ON call debug_stack_reset If the TRACE_IRQS_ON happens to hit a breakpoint then it will keep the current stack and not crash the box. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * x86: Allow nesting of the debug stack IDT settingSteven Rostedt2012-06-011-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the NMI handler runs, it checks if it preempted a debug handler and if that handler is using the debug stack. If it is, it changes the IDT table not to update the stack, otherwise it will reset the debug stack and corrupt the debug handler it preempted. Now that ftrace uses breakpoints to change functions from nops to callers, many more places may hit a breakpoint. Unfortunately this includes some of the calls that lockdep performs. Which causes issues with the debug stack. It too needs to change the debug stack before tracing (if called from the debug handler). Allow the debug_stack_set_zero() and debug_stack_reset() to be nested so that the debug handlers can take advantage of them too. [ Used this_cpu_*() over __get_cpu_var() as suggested by H. Peter Anvin ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * x86: Reset the debug_stack update counterSteven Rostedt2012-06-011-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an NMI goes off and it sees that it preempted the debug stack, to keep the debug stack safe, it changes the IDT to point to one that does not modify the stack on breakpoint (to allow breakpoints in NMIs). But the variable that gets set to know to undo it on exit never gets cleared on exit. Thus every NMI will reset it on exit the first time it is done even if it does not need to be reset. [ Added H. Peter Anvin's suggestion to use this_cpu_read/write ] Cc: <stable@vger.kernel.org> # v3.3 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * ftrace: Use breakpoint method to update ftrace callerSteven Rostedt2012-06-011-16/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On boot up and module load, it is fine to modify the code directly, without the use of breakpoints. This is because boot up modification is done before SMP is initialized, thus the modification is serial, and module load is done before the module executes. But after that we must use a SMP safe method to modify running code. Otherwise, if we are running the function tracer and update its function (by starting off the stack tracer, or perf tracing) the change of the function called by the ftrace trampoline is done directly. If this is being executed on another CPU, that CPU may take a GPF and crash the kernel. The breakpoint method is used to change the nops at all the functions, but the change of the ftrace callback handler itself was still using a direct modification. If tracing was enabled and the function callback was changed then another CPU could fault if it was currently calling the original callback. This modification must use the breakpoint method too. Note, the direct method is still used for boot up and module load. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * ftrace: Synchronize variable setting with breakpointsSteven Rostedt2012-06-013-6/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the function tracer starts modifying the code via breakpoints it sets a variable (modifying_ftrace_code) to inform the breakpoint handler to call the ftrace int3 code. But there's no synchronization between setting this code and the handler, thus it is possible for the handler to be called on another CPU before it sees the variable. This will cause a kernel crash as the int3 handler will not know what to do with it. I originally added smp_mb()'s to force the visibility of the variable but H. Peter Anvin suggested that I just make it atomic. [ Added comments as suggested by Peter Zijlstra ] Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32H.J. Lu2012-06-011-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I added x32 ptrace to 3.4 kernel, I also include PTRACE_ARCH_PRCTL support for x32 GDB For ARCH_GET_FS/GS, it takes a pointer to int64. But at user level, ARCH_GET_FS/GS takes a pointer to int32. So I have to add x32 ptrace to glibc to handle it with a temporary int64 passed to kernel and copy it back to GDB as int32. Roland suggested that PTRACE_ARCH_PRCTL is obsolete and x32 GDB should use fs_base and gs_base fields of user_regs_struct instead. Accordingly, remove PTRACE_ARCH_PRCTL completely from the x32 code to avoid possible memory overrun when pointer to int32 is passed to kernel. Link: http://lkml.kernel.org/r/CAMe9rOpDzHfS7NH7m1vmD9QRw8SSj4Sc%2BaNOgcWm_WJME2eRsQ@mail.gmail.com Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@vger.kernel.org> v3.4
* | x86, efi: Add EFI boot stub documentationMatt Fleming2012-06-012-0/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we can't expect every user to read the EFI boot stub code it seems prudent to have a couple of paragraphs explaining what it is and how it works. The "initrd=" option in particular is tricky because it only understands absolute EFI-style paths (backslashes as directory separators), and until now this hasn't been documented anywhere. This has tripped up a couple of users. Cc: Matthew Garrett <mjg@redhat.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1331907517-3985-4-git-send-email-matt@console-pimps.org Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* | x86, efi; Add EFI boot stub console supportMatt Fleming2012-06-012-16/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We need a way of printing useful messages to the user, for example when we fail to open an initrd file, instead of just hanging the machine without giving the user any indication of what went wrong. So sprinkle some error messages throughout the EFI boot stub code to make it easier for users to diagnose/report problems. Reported-by: Keshav P R <the.ridikulus.rat@gmail.com> Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1331907517-3985-3-git-send-email-matt@console-pimps.org Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* | x86, efi: Only close open files in error pathMatt Fleming2012-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The loop at the 'close_handles' label in handle_ramdisks() should be using 'i', which represents the number of initrd files that were successfully opened, not 'nr_initrds' which is the number of initrd= arguments passed on the command line. Currently, if we execute the loop to close all file handles and we failed to open any initrds we'll try to call the close function on a garbage pointer, causing the machine to hang. Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1331907517-3985-2-git-send-email-matt@console-pimps.org Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2012-06-0145-186/+220
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull second pile of signal handling patches from Al Viro: "This one is just task_work_add() series + remaining prereqs for it. There probably will be another pull request from that tree this cycle - at least for helpers, to get them out of the way for per-arch fixes remaining in the tree." Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile had brought in commit 97fd75b7b8e0 ("kernel/irq/manage.c: use the pr_foo() infrastructure to prefix printks") which changed one of the pr_err() calls that this merge moves around. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: keys: kill task_struct->replacement_session_keyring keys: kill the dummy key_replace_session_keyring() keys: change keyctl_session_to_parent() to use task_work_add() genirq: reimplement exit_irq_thread() hook via task_work_add() task_work_add: generic process-context callbacks avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers parisc: need to check NOTIFY_RESUME when exiting from syscall move key_repace_session_keyring() into tracehook_notify_resume() TIF_NOTIFY_RESUME is defined on all targets now
| * | keys: kill task_struct->replacement_session_keyringOleg Nesterov2012-05-242-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kill the no longer used task_struct->replacement_session_keyring, update copy_creds() and exit_creds(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Smith <dsmith@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | keys: kill the dummy key_replace_session_keyring()Oleg Nesterov2012-05-242-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the previouse change key_replace_session_keyring() becomes a nop. Remove the dummy definition in key.h and update the callers in arch/*/kernel/signal.c. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Smith <dsmith@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | keys: change keyctl_session_to_parent() to use task_work_add()Oleg Nesterov2012-05-244-45/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change keyctl_session_to_parent() to use task_work_add() and move key_replace_session_keyring() logic into task_work->func(). Note that we do task_work_cancel() before task_work_add() to ensure that only one work can be pending at any time. This is important, we must not allow user-space to abuse the parent's ->task_works list. The callback, replace_session_keyring(), checks PF_EXITING. I guess this is not really needed but looks better. As a side effect, this fixes the (unlikely) race. The callers of key_replace_session_keyring() and keyctl_session_to_parent() lack the necessary barriers, the parent can miss the request. Now we can remove task_struct->replacement_session_keyring and related code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Smith <dsmith@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | genirq: reimplement exit_irq_thread() hook via task_work_add()Oleg Nesterov2012-05-244-49/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | exit_irq_thread() and task->irq_thread are needed to handle the unexpected (and unlikely) exit of irq-thread. We can use task_work instead and make this all private to kernel/irq/manage.c, cleanup plus micro-optimization. 1. rename exit_irq_thread() to irq_thread_dtor(), make it static, and move it up before irq_thread(). 2. change irq_thread() to do task_work_add(irq_thread_dtor) at the start and task_work_cancel() before return. tracehook_notify_resume() can never play with kthreads, only do_exit()->exit_task_work() can call the callback and this is what we want. 3. remove task_struct->irq_thread and the special hook in do_exit(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Smith <dsmith@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | task_work_add: generic process-context callbacksOleg Nesterov2012-05-247-2/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a simple mechanism that allows running code in the (nonatomic) context of the arbitrary task. The caller does task_work_add(task, task_work) and this task executes task_work->func() either from do_notify_resume() or from do_exit(). The callback can rely on PF_EXITING to detect the latter case. "struct task_work" can be embedded in another struct, still it has "void *data" to handle the most common/simple case. This allows us to kill the ->replacement_session_keyring hack, and potentially this can have more users. Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into tracehook_notify_resume() and do_exit(). But at the same time we can remove the "replacement_session_keyring != NULL" checks from arch/*/signal.c and exit_creds(). Note: task_work_add/task_work_run abuses ->pi_lock. This is only because this lock is already used by lookup_pi_state() to synchronize with do_exit() setting PF_EXITING. Fortunately the scope of this lock in task_work.c is really tiny, and the code is unlikely anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Smith <dsmith@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callersAl Viro2012-05-241-1/+1
| | | | | | | | | | | | | | | | | | we need that not just on syscall returns but on irq ones as well... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | parisc: need to check NOTIFY_RESUME when exiting from syscallAl Viro2012-05-241-1/+1
| | | | | | | | | | | | | | | | | | ... not just on return from interrupt Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | move key_repace_session_keyring() into tracehook_notify_resume()Al Viro2012-05-2430-66/+6
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | TIF_NOTIFY_RESUME is defined on all targets nowAl Viro2012-05-242-12/+0
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'for-3.5-take-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2012-06-0116-209/+322
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd update from Bruce Fields. * 'for-3.5-take-2' of git://linux-nfs.org/~bfields/linux: (23 commits) nfsd: trivial: use SEEK_SET instead of 0 in vfs_llseek SUNRPC: split upcall function to extract reusable parts nfsd: allocate id-to-name and name-to-id caches in per-net operations. nfsd: make name-to-id cache allocated per network namespace context nfsd: make id-to-name cache allocated per network namespace context nfsd: pass network context to idmap init/exit functions nfsd: allocate export and expkey caches in per-net operations. nfsd: make expkey cache allocated per network namespace context nfsd: make export cache allocated per network namespace context nfsd: pass pointer to export cache down to stack wherever possible. nfsd: pass network context to export caches init/shutdown routines Lockd: pass network namespace to creation and destruction routines NFSd: remove hard-coded dereferences to name-to-id and id-to-name caches nfsd: pass pointer to expkey cache down to stack wherever possible. nfsd: use hash table from cache detail in nfsd export seq ops nfsd: pass svc_export_cache pointer as private data to "exports" seq file ops nfsd: use exp_put() for svc_export_cache put nfsd: use cache detail pointer from svc_export structure on cache put nfsd: add link to owner cache detail to svc_export structure nfsd: use passed cache_detail pointer expkey_parse() ...
| * | | nfsd: trivial: use SEEK_SET instead of 0 in vfs_llseekJeff Layton2012-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | They're equivalent, but SEEK_SET is more informative... Cc: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | SUNRPC: split upcall function to extract reusable partsSimo Sorce2012-04-251-36/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is needed to share code between the current server upcall mechanism and the new gssproxy upcall mechanism introduced in a following patch. Signed-off-by: Simo Sorce <simo@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: allocate id-to-name and name-to-id caches in per-net operations.Stanislav Kinsbursky2012-04-121-7/+7
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: make name-to-id cache allocated per network namespace contextStanislav Kinsbursky2012-04-122-10/+16
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: make id-to-name cache allocated per network namespace contextStanislav Kinsbursky2012-04-122-10/+26
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: pass network context to idmap init/exit functionsStanislav Kinsbursky2012-04-123-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions will be called from per-net operations. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: allocate export and expkey caches in per-net operations.Stanislav Kinsbursky2012-04-121-6/+21
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: make expkey cache allocated per network namespace contextStanislav Kinsbursky2012-04-126-24/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch also changes svcauth_unix_purge() function: added network namespace as a parameter and thus loop over all networks was replaced by only one call for ip map cache purge. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: make export cache allocated per network namespace contextStanislav Kinsbursky2012-04-125-19/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch also changes prototypes of nfsd_export_flush() and exp_rootfh(): network namespace parameter added. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: pass pointer to export cache down to stack wherever possible.Stanislav Kinsbursky2012-04-121-15/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This cache will be per-net soon. And it's easier to get the pointer to desired per-net instance only once and then pass it down instead of discovering it in every place were required. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: pass network context to export caches init/shutdown routinesStanislav Kinsbursky2012-04-123-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions will be called from per-net operations. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | Lockd: pass network namespace to creation and destruction routinesStanislav Kinsbursky2012-04-114-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | v2: dereference of most probably already released nlm_host removed in nlmclnt_done() and reclaimer(). These routines are called from locks reclaimer() kernel thread. This thread works in "init_net" network context and currently relays on persence on lockd thread and it's per-net resources. Thus lockd_up() and lockd_down() can't relay on current network context. So let's pass corrent one into them. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | NFSd: remove hard-coded dereferences to name-to-id and id-to-name cachesStanislav Kinsbursky2012-04-111-26/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These dereferences to global static caches are redundant. They also prevents converting these caches into per-net ones. So this patch is cleanup + precursor of patch set,a which will make them per-net. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: pass pointer to expkey cache down to stack wherever possible.Stanislav Kinsbursky2012-04-111-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This cache will be per-net soon. And it's easier to get the pointer to desired per-net instance only once and then pass it down instead of discovering it in every place were required. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: use hash table from cache detail in nfsd export seq opsStanislav Kinsbursky2012-04-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Hard-code is redundant and will prevent from making caches per net ns. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: pass svc_export_cache pointer as private data to "exports" seq file opsStanislav Kinsbursky2012-04-112-8/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Global svc_export_cache cache is going to be replaced with per-net instance. So prepare the ground for it. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: use exp_put() for svc_export_cache putStanislav Kinsbursky2012-04-112-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch replaces cache_put() call for svc_export_cache by exp_put() call. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: use cache detail pointer from svc_export structure on cache putStanislav Kinsbursky2012-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Hard-coded pointer is redundant now and can be replaced. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: add link to owner cache detail to svc_export structureStanislav Kinsbursky2012-04-112-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without info about owner cache datail it won't be able to find out, which per-net cache detail have to be. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: use passed cache_detail pointer expkey_parse()Stanislav Kinsbursky2012-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using of hard-coded svc_expkey_cache pointer in expkey_parse() looks redundant. Moreover, global cache will be replaced with per-net instance soon. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | nfsd: don't use locks_in_grace to determine whether to call nfs4_grace_endJeff Layton2012-04-111-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible that lockd or another lock manager might still be on the list after we call nfsd4_end_grace. If the laundromat thread runs again at that point, then we could end up calling nfsd4_end_grace more than once. That's not only inefficient, but calling nfsd4_recdir_purge_old more than once could be problematic. Fix this by adding a new global "grace_ended" flag and use that to determine whether we've already called nfsd4_grace_end. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * | | svcauth: remove unused defineSimo Sorce2012-04-111-1/+0
| | | | | | | | | | | | | | | | Signed-off-by: Simo Sorce <simo@redhat.com>
| * | | nfsd: trivial: remove unused variable from nfsd4_lockJeff Layton2012-04-111-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ..."fp" is set but never used. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | | | Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds2012-06-01108-699/+10620
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc patches from Andrew Morton: - the "misc" tree - stuff from all over the map - checkpatch updates - fatfs - kmod changes - procfs - cpumask - UML - kexec - mqueue - rapidio - pidns - some checkpoint-restore feature work. Reluctantly. Most of it delayed a release. I'm still rather worried that we don't have a clear roadmap to completion for this work. * emailed from Andrew Morton <akpm@linux-foundation.org>: (78 patches) kconfig: update compression algorithm info c/r: prctl: add ability to set new mm_struct::exe_file c/r: prctl: extend PR_SET_MM to set up more mm_struct entries c/r: procfs: add arg_start/end, env_start/end and exit_code members to /proc/$pid/stat syscalls, x86: add __NR_kcmp syscall fs, proc: introduce /proc/<pid>/task/<tid>/children entry sysctl: make kernel.ns_last_pid control dependent on CHECKPOINT_RESTORE aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() eventfd: change int to __u64 in eventfd_signal() fs/nls: add Apple NLS pidns: make killed children autoreap pidns: use task_active_pid_ns in do_notify_parent rapidio/tsi721: add DMA engine support rapidio: add DMA engine support for RIO data transfers ipc/mqueue: add rbtree node caching support tools/selftests: add mq_perf_tests ipc/mqueue: strengthen checks on mqueue creation ipc/mqueue: correct mq_attr_ok test ipc/mqueue: improve performance of send/recv selftests: add mq_open_tests ...
| * | | | kconfig: update compression algorithm infoRandy Dunlap2012-06-012-11/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There have been new compression algorithms added without updating nearby relevant descriptive text that refers to (a) the number of compression algorithms and (b) the most recent one. Fix these inconsistencies. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Reported-by: <qasdfgtyuiop@gmail.com> Cc: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | c/r: prctl: add ability to set new mm_struct::exe_fileCyrill Gorcunov2012-06-012-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we do restore we would like to have a way to setup a former mm_struct::exe_file so that /proc/pid/exe would point to the original executable file a process had at checkpoint time. For this the PR_SET_MM_EXE_FILE code is introduced. This option takes a file descriptor which will be set as a source for new /proc/$pid/exe symlink. Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE vmas present for current process, simply because this feature is a special to C/R and mm::num_exe_file_vmas become meaningless after that. To minimize the amount of transition the /proc/pid/exe symlink might have, this feature is implemented in one-shot manner. Thus once changed the symlink can't be changed again. This should help sysadmins to monitor the symlinks over all process running in a system. In particular one could make a snapshot of processes and ring alarm if there unexpected changes of /proc/pid/exe's in a system. Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and the caller must have CAP_SYS_RESOURCE capability granted, otherwise the request to change symlink will be rejected. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | c/r: prctl: extend PR_SET_MM to set up more mm_struct entriesCyrill Gorcunov2012-06-012-51/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During checkpoint we dump whole process memory to a file and the dump includes process stack memory. But among stack data itself, the stack carries additional parameters such as command line arguments, environment data and auxiliary vector. So when we do restore procedure and once we've restored stack data itself we need to setup mm_struct::arg_start/end, env_start/end, so restored process would be able to find command line arguments and environment data it had at checkpoint time. The same applies to auxiliary vector. For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START | ENV_END | AUXV) codes are introduced. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | c/r: procfs: add arg_start/end, env_start/end and exit_code members to ↵Cyrill Gorcunov2012-06-012-3/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | /proc/$pid/stat We would like to have an ability to restore command line arguments and program environment pointers but first we need to obtain them somehow. Thus we put these values into /proc/$pid/stat. The exit_code is needed to restore zombie tasks. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | syscalls, x86: add __NR_kcmp syscallCyrill Gorcunov2012-06-0110-1/+348
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While doing the checkpoint-restore in the user space one need to determine whether various kernel objects (like mm_struct-s of file_struct-s) are shared between tasks and restore this state. The 2nd step can be solved by using appropriate CLONE_ flags and the unshare syscall, while there's currently no ways for solving the 1st one. One of the ways for checking whether two tasks share e.g. mm_struct is to provide some mm_struct ID of a task to its proc file, but showing such info considered to be not that good for security reasons. Thus after some debates we end up in conclusion that using that named 'comparison' syscall might be the best candidate. So here is it -- __NR_kcmp. It takes up to 5 arguments - the pids of the two tasks (which characteristics should be compared), the comparison type and (in case of comparison of files) two file descriptors. Lookups for pids are done in the caller's PID namespace only. At moment only x86 is supported and tested. [akpm@linux-foundation.org: fix up selftests, warnings] [akpm@linux-foundation.org: include errno.h] [akpm@linux-foundation.org: tweak comment text] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Glauber Costa <glommer@parallels.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tejun Heo <tj@kernel.org> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Valdis.Kletnieks@vt.edu Cc: Michal Marek <mmarek@suse.cz> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>