summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* objtool: Add UACCESS validationPeter Zijlstra2019-04-0310-21/+226
| | | | | | | | | | | | | | | | | | | | | | | | | It is important that UACCESS regions are as small as possible; furthermore the UACCESS state is not scheduled, so doing anything that might directly call into the scheduler will cause random code to be ran with UACCESS enabled. Teach objtool too track UACCESS state and warn about any CALL made while UACCESS is enabled. This very much includes the __fentry__() and __preempt_schedule() calls. Note that exceptions _do_ save/restore the UACCESS state, and therefore they can drive preemption. This also means that all exception handlers must have an otherwise redundant UACCESS disable instruction; therefore ignore this warning for !STT_FUNC code (exception handlers are not normal functions). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* objtool: Fix sibling call detectionPeter Zijlstra2019-04-031-31/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It turned out that we failed to detect some sibling calls; specifically those without relocation records; like: $ ./objdump-func.sh defconfig-build/mm/kasan/generic.o __asan_loadN 0000 0000000000000840 <__asan_loadN>: 0000 840: 48 8b 0c 24 mov (%rsp),%rcx 0004 844: 31 d2 xor %edx,%edx 0006 846: e9 45 fe ff ff jmpq 690 <check_memory_region> So extend the cross-function jump to also consider those that are not between known (or newly detected) parent/child functions, as sibling-cals when they jump to the start of the function. The second part of that condition is to deal with random jumps to the middle of other function, as can be found in arch/x86/lib/copy_user_64.S for example. This then (with later patches applied) makes the above recognise the sibling call: mm/kasan/generic.o: warning: objtool: __asan_loadN()+0x6: call to check_memory_region() with UACCESS enabled Also make sure to set insn->call_dest for sibling calls so we can know who we're calling. This is useful information when printing validation warnings later. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* objtool: Rewrite alt->skip_origPeter Zijlstra2019-04-031-6/+10
| | | | | | | | | | | | | | | | Really skip the original instruction flow, instead of letting it continue with NOPs. Since the alternative code flow already continues after the original instructions, only the alt-original is skipped. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* objtool: Add --backtrace supportPeter Zijlstra2019-04-034-6/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For when you want to know the path that reached your fail state: $ ./objtool check --no-fp --backtrace arch/x86/lib/usercopy_64.o arch/x86/lib/usercopy_64.o: warning: objtool: .altinstr_replacement+0x3: UACCESS disable without MEMOPs: __clear_user() arch/x86/lib/usercopy_64.o: warning: objtool: __clear_user()+0x3a: (alt) arch/x86/lib/usercopy_64.o: warning: objtool: __clear_user()+0x2e: (branch) arch/x86/lib/usercopy_64.o: warning: objtool: __clear_user()+0x18: (branch) arch/x86/lib/usercopy_64.o: warning: objtool: .altinstr_replacement+0xffffffffffffffff: (branch) arch/x86/lib/usercopy_64.o: warning: objtool: __clear_user()+0x5: (alt) arch/x86/lib/usercopy_64.o: warning: objtool: __clear_user()+0x0: <=== (func) 0000000000000000 <__clear_user>: 0: e8 00 00 00 00 callq 5 <__clear_user+0x5> 1: R_X86_64_PLT32 __fentry__-0x4 5: 90 nop 6: 90 nop 7: 90 nop 8: 48 89 f0 mov %rsi,%rax b: 48 c1 ee 03 shr $0x3,%rsi f: 83 e0 07 and $0x7,%eax 12: 48 89 f1 mov %rsi,%rcx 15: 48 85 c9 test %rcx,%rcx 18: 74 0f je 29 <__clear_user+0x29> 1a: 48 c7 07 00 00 00 00 movq $0x0,(%rdi) 21: 48 83 c7 08 add $0x8,%rdi 25: ff c9 dec %ecx 27: 75 f1 jne 1a <__clear_user+0x1a> 29: 48 89 c1 mov %rax,%rcx 2c: 85 c9 test %ecx,%ecx 2e: 74 0a je 3a <__clear_user+0x3a> 30: c6 07 00 movb $0x0,(%rdi) 33: 48 ff c7 inc %rdi 36: ff c9 dec %ecx 38: 75 f6 jne 30 <__clear_user+0x30> 3a: 90 nop 3b: 90 nop 3c: 90 nop 3d: 48 89 c8 mov %rcx,%rax 40: c3 retq Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* objtool: Rewrite add_ignores()Peter Zijlstra2019-04-032-32/+20
| | | | | | | | | | | | | The whole add_ignores() thing was wildly weird; rewrite it according to 'modern' ways. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* objtool: Handle function aliasesPeter Zijlstra2019-04-032-5/+12
| | | | | | | | | | | | | | | | | Function aliases result in different symbols for the same set of instructions; track a canonical symbol so there is a unique point of access. This again prepares the way for function attributes. And in particular the need for aliases comes from how KASAN uses them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* objtool: Set insn->func for alternativesPeter Zijlstra2019-04-031-0/+1
| | | | | | | | | | | | | | | | In preparation of function attributes, we need each instruction to have a valid link back to its function. Therefore make sure we set the function association for alternative instruction sequences; they are, after all, still part of the function. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess, kcov: Disable stack protectorPeter Zijlstra2019-04-031-0/+1
| | | | | | | | | | | | | | | | | | New tooling noticed this mishap: kernel/kcov.o: warning: objtool: write_comp_data()+0x138: call to __stack_chk_fail() with UACCESS enabled kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc()+0xd9: call to __stack_chk_fail() with UACCESS enabled All the other instrumentation (KASAN,UBSAN) also have stack protector disabled. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAPPeter Zijlstra2019-04-031-0/+4
| | | | | | | | | | | | | | | | | For CONFIG_TRACE_BRANCH_PROFILING=y the likely/unlikely things get overloaded and generate callouts to this code, and thus also when AC=1. Make it safe. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess, ubsan: Fix UBSAN vs. SMAPPeter Zijlstra2019-04-032-0/+5
| | | | | | | | | | | | | | | | | UBSAN can insert extra code in random locations; including AC=1 sections. Typically this code is not safe and needs wrapping. So far, only __ubsan_handle_type_mismatch* have been observed in AC=1 sections and therefore only those are annotated. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess, kasan: Fix KASAN vs SMAPPeter Zijlstra2019-04-033-2/+14
| | | | | | | | | | | | | | | | | | | | | | | KASAN inserts extra code for every LOAD/STORE emitted by te compiler. Much of this code is simple and safe to run with AC=1, however the kasan_report() function, called on error, is most certainly not safe to call with AC=1. Therefore wrap kasan_report() in user_access_{save,restore}; which for x86 SMAP, saves/restores EFLAGS and clears AC before calling the real function. Also ensure all the functions are without __fentry__ hook. The function tracer is also not safe. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/smap: Ditch __stringify()Peter Zijlstra2019-04-031-10/+9
| | | | | | | | | | | | | Linus noticed all users of __ASM_STAC/__ASM_CLAC are with __stringify(). Just make them a string. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess: Introduce user_access_{save,restore}()Peter Zijlstra2019-04-033-0/+25
| | | | | | | | | | | | | Introduce common helpers for when we need to safely suspend a uaccess section; for instance to generate a {KA,UB}SAN report. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess, signal: Fix AC=1 bloatPeter Zijlstra2019-04-031-12/+17
| | | | | | | | | | | | | | | | | | | | Occasionally GCC is less agressive with inlining and the following is observed: arch/x86/kernel/signal.o: warning: objtool: restore_sigcontext()+0x3cc: call to force_valid_ss.isra.5() with UACCESS enabled arch/x86/kernel/signal.o: warning: objtool: do_signal()+0x384: call to frame_uc_flags.isra.0() with UACCESS enabled Cure this by moving this code out of the AC=1 region, since it really isn't needed for the user access. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess: Always inline user_access_begin()Peter Zijlstra2019-04-031-1/+1
| | | | | | | | | | | | | If GCC out-of-lines it, the STAC and CLAC are in different fuctions and objtool gets upset. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess, xen: Suppress SMAP warningsPeter Zijlstra2019-04-031-4/+20
| | | | | | | | | | | | | | | | | drivers/xen/privcmd.o: warning: objtool: privcmd_ioctl()+0x1414: call to hypercall_page() with UACCESS enabled Some Xen hypercalls allow parameter buffers in user land, so they need to set AC=1. Avoid the warning for those cases. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: andrew.cooper3@citrix.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/nospec, objtool: Introduce ANNOTATE_IGNORE_ALTERNATIVEPeter Zijlstra2019-04-034-23/+34
| | | | | | | | | | | | | To facillitate other usage of ignoring alternatives; rename ANNOTATE_NOSPEC_IGNORE to ANNOTATE_IGNORE_ALTERNATIVE. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess: Fix up the fixupPeter Zijlstra2019-04-031-1/+2
| | | | | | | | | | | | | | | | | | | | New tooling got confused about this: arch/x86/lib/memcpy_64.o: warning: objtool: .fixup+0x7: return with UACCESS enabled While the code isn't wrong, it is tedious (if at all possible) to figure out what function a particular chunk of .fixup belongs to. This then confuses the objtool uaccess validation. Instead of returning directly from the .fixup, jump back into the right function. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess: Move copy_user_handle_tail() into asmPeter Zijlstra2019-04-034-47/+48
| | | | | | | | | | | | | | | By writing the function in asm we avoid cross object code flow and objtool no longer gets confused about a 'stray' CLAC. Also; the asm version is actually _simpler_. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* i915, uaccess: Fix redundant CLACPeter Zijlstra2019-04-031-2/+4
| | | | | | | | | | | | | | | | | | New tooling noticed this: drivers/gpu/drm/i915/i915_gem_execbuffer.o: warning: objtool: .altinstr_replacement+0x3c: redundant UACCESS disable drivers/gpu/drm/i915/i915_gem_execbuffer.o: warning: objtool: .altinstr_replacement+0x66: redundant UACCESS disable You don't need user_access_end() if user_access_begin() fails. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/ia32: Fix ia32_restore_sigcontext() AC leakPeter Zijlstra2019-04-031-12/+17
| | | | | | | | | | | | | Objtool spotted that we call native_load_gs_index() with AC set. Re-arrange the code to avoid that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* tracing: Improve "if" macro code generationJosh Poimboeuf2019-04-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With CONFIG_PROFILE_ALL_BRANCHES=y, the "if" macro converts the conditional to an array index. This can cause GCC to create horrible code. When there are nested ifs, the generated code uses register values to encode branching decisions. Make it easier for GCC to optimize by keeping the conditional as a conditional rather than converting it to an integer. This shrinks the generated code quite a bit, and also makes the code sane enough for objtool to understand. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: brgerst@gmail.com Cc: catalin.marinas@arm.com Cc: dvlasenk@redhat.com Cc: dvyukov@google.com Cc: hpa@zytor.com Cc: james.morse@arm.com Cc: julien.thierry@arm.com Cc: luto@amacapital.net Cc: luto@kernel.org Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: will.deacon@arm.com Link: https://lkml.kernel.org/r/20190307174802.46fmpysxyo35hh43@treble Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/x86: Save [ER]FLAGS on context switchPeter Zijlstra2019-04-035-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Effectively reverts commit: 2c7577a75837 ("sched/x86_64: Don't save flags on context switch") Specifically because SMAP uses FLAGS.AC which invalidates the claim that the kernel has clean flags. In particular; while preemption from interrupt return is fine (the IRET frame on the exception stack contains FLAGS) it breaks any code that does synchonous scheduling, including preempt_enable(). This has become a significant issue ever since commit: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses") provided for means of having 'normal' C code between STAC / CLAC, exposing the FLAGS.AC state. So far this hasn't led to trouble, however fix it before it comes apart. Reported-by: Julien Thierry <julien.thierry@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses") Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'work.aio' of ↵Linus Torvalds2019-04-011-188/+150
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull aio race fixes and cleanups from Al Viro. The aio code had more issues with error handling and races with the aio completing at just the right (wrong) time along with freeing the file descriptor when another thread closes the file. Just a couple of these commits are the actual fixes: the others are cleanups to either make the fixes simpler, or to make the code legible and understandable enough that we hope there's no more fundamental races hiding. * 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: aio: move sanity checks and request allocation to io_submit_one() deal with get_reqs_available() in aio_get_req() itself aio: move dropping ->ki_eventfd into iocb_destroy() make aio_read()/aio_write() return int Fix aio_poll() races aio: store event at final iocb_put() aio: keep io_event in aio_kiocb aio: fold lookup_kiocb() into its sole caller pin iocb through aio.
| * aio: move sanity checks and request allocation to io_submit_one()Al Viro2019-03-181-66/+53
| | | | | | | | | | | | makes for somewhat cleaner control flow in __io_submit_one() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * deal with get_reqs_available() in aio_get_req() itselfAl Viro2019-03-181-6/+6
| | | | | | | | | | | | | | simplifies the caller Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * aio: move dropping ->ki_eventfd into iocb_destroy()Al Viro2019-03-181-9/+8
| | | | | | | | | | | | no reason to duplicate that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * make aio_read()/aio_write() return intAl Viro2019-03-181-6/+6
| | | | | | | | | | | | | | | | that ssize_t is a rudiment of earlier calling conventions; it's been used only to pass 0 and -E... since last autumn. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * Fix aio_poll() racesAl Viro2019-03-181-50/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | aio_poll() has to cope with several unpleasant problems: * requests that might stay around indefinitely need to be made visible for io_cancel(2); that must not be done to a request already completed, though. * in cases when ->poll() has placed us on a waitqueue, wakeup might have happened (and request completed) before ->poll() returns. * worse, in some early wakeup cases request might end up re-added into the queue later - we can't treat "woken up and currently not in the queue" as "it's not going to stick around indefinitely" * ... moreover, ->poll() might have decided not to put it on any queues to start with, and that needs to be distinguished from the previous case * ->poll() might have tried to put us on more than one queue. Only the first will succeed for aio poll, so we might end up missing wakeups. OTOH, we might very well notice that only after the wakeup hits and request gets completed (all before ->poll() gets around to the second poll_wait()). In that case it's too late to decide that we have an error. req->woken was an attempt to deal with that. Unfortunately, it was broken. What we need to keep track of is not that wakeup has happened - the thing might come back after that. It's that async reference is already gone and won't come back, so we can't (and needn't) put the request on the list of cancellables. The easiest case is "request hadn't been put on any waitqueues"; we can tell by seeing NULL apt.head, and in that case there won't be anything async. We should either complete the request ourselves (if vfs_poll() reports anything of interest) or return an error. In all other cases we get exclusion with wakeups by grabbing the queue lock. If request is currently on queue and we have something interesting from vfs_poll(), we can steal it and complete the request ourselves. If it's on queue and vfs_poll() has not reported anything interesting, we either put it on the cancellable list, or, if we know that it hadn't been put on all queues ->poll() wanted it on, we steal it and return an error. If it's _not_ on queue, it's either been already dealt with (in which case we do nothing), or there's aio_poll_complete_work() about to be executed. In that case we either put it on the cancellable list, or, if we know it hadn't been put on all queues ->poll() wanted it on, simulate what cancel would've done. It's a lot more convoluted than I'd like it to be. Single-consumer APIs suck, and unfortunately aio is not an exception... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * aio: store event at final iocb_put()Al Viro2019-03-181-16/+17
| | | | | | | | | | | | | | | | Instead of having aio_complete() set ->ki_res.{res,res2}, do that explicitly in its callers, drop the reference (as aio_complete() used to do) and delay the rest until the final iocb_put(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * aio: keep io_event in aio_kiocbAl Viro2019-03-181-18/+13
| | | | | | | | | | | | | | We want to separate forming the resulting io_event from putting it into the ring buffer. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * aio: fold lookup_kiocb() into its sole callerAl Viro2019-03-181-22/+7
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * pin iocb through aio.Linus Torvalds2019-03-181-16/+21
| | | | | | | | | | | | | | | | | | aio_poll() is not the only case that needs file pinned; worse, while aio_read()/aio_write() can live without pinning iocb itself, the proof is rather brittle and can easily break on later changes. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2019-04-014-13/+14
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull symlink fixes from Al Viro: "The ceph fix is already in mainline, Daniel's bpf fix is in bpf tree (1da6c4d9140c "bpf: fix use after free in bpf_evict_inode"), the rest is in here" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: debugfs: fix use-after-free on symlink traversal ubifs: fix use-after-free on symlink traversal jffs2: fix use-after-free on symlink traversal
| * | debugfs: fix use-after-free on symlink traversalAl Viro2019-04-011-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | symlink body shouldn't be freed without an RCU delay. Switch debugfs to ->destroy_inode() and use of call_rcu(); free both the inode and symlink body in the callback. Similar to solution for bpf, only here it's even more obvious that ->evict_inode() can be dropped. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | ubifs: fix use-after-free on symlink traversalAl Viro2019-04-011-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | free the symlink body after the same RCU delay we have for freeing the struct inode itself, so that traversal during RCU pathwalk wouldn't step into freed memory. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | jffs2: fix use-after-free on symlink traversalAl Viro2019-04-012-6/+4
| |/ | | | | | | | | | | | | | | free the symlink body after the same RCU delay we have for freeing the struct inode itself, so that traversal during RCU pathwalk wouldn't step into freed memory. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Linux 5.1-rc3v5.1-rc3Linus Torvalds2019-03-311-1/+1
| |
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2019-03-3160-201/+409
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: "A collection of x86 and ARM bugfixes, and some improvements to documentation. On top of this, a cleanup of kvm_para.h headers, which were exported by some architectures even though they not support KVM at all. This is responsible for all the Kbuild changes in the diffstat" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits) Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGION KVM: doc: Document the life cycle of a VM and its resources KVM: selftests: complete IO before migrating guest state KVM: selftests: disable stack protector for all KVM tests KVM: selftests: explicitly disable PIE for tests KVM: selftests: assert on exit reason in CR4/cpuid sync test KVM: x86: update %rip after emulating IO x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts kvm: don't redefine flags as something else kvm: mmu: Used range based flushing in slot_handle_level_range KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) KVM: Reject device ioctls from processes other than the VM's creator KVM: doc: Fix incorrect word ordering regarding supported use of APIs KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size' KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT ...
| * \ Merge tag 'kvmarm-fixes-for-5.1' of ↵Paolo Bonzini2019-03-289-75/+133
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/ARM fixes for 5.1 - Fix THP handling in the presence of pre-existing PTEs - Honor request for PTE mappings even when THPs are available - GICv4 performance improvement - Take the srcu lock when writing to guest-controlled ITS data structures - Reset the virtual PMU in preemptible context - Various cleanups
| | * | KVM: arm/arm64: Comments cleanup in mmu.cZenghui Yu2019-03-281-14/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some comments in virt/kvm/arm/mmu.c are outdated. Update them to reflect the current state of the code. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> [maz: commit message tidy-up] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-its: Make attribute accessors staticYueHaibing2019-03-201-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix sparse warnings: arch/arm64/kvm/../../../virt/kvm/arm/vgic/vgic-its.c:1732:5: warning: symbol 'vgic_its_has_attr_regs' was not declared. Should it be static? arch/arm64/kvm/../../../virt/kvm/arm/vgic/vgic-its.c:1753:5: warning: symbol 'vgic_its_attr_regs_access' was not declared. Should it be static? Signed-off-by: YueHaibing <yuehaibing@huawei.com> [maz: fixed subject] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: Fix handling of stage2 huge mappingsSuzuki K Poulose2019-03-202-16/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We rely on the mmu_notifier call backs to handle the split/merge of huge pages and thus we are guaranteed that, while creating a block mapping, either the entire block is unmapped at stage2 or it is missing permission. However, we miss a case where the block mapping is split for dirty logging case and then could later be made block mapping, if we cancel the dirty logging. This not only creates inconsistent TLB entries for the pages in the the block, but also leakes the table pages for PMD level. Handle this corner case for the huge mappings at stage2 by unmapping the non-huge mapping for the block. This could potentially release the upper level table. So we need to restart the table walk once we unmap the range. Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages") Reported-by: Zheng Xiang <zhengxiang9@huawei.com> Cc: Zheng Xiang <zhengxiang9@huawei.com> Cc: Zenghui Yu <yuzenghui@huawei.com> Cc: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: Enforce PTE mappings at stage2 when neededSuzuki K Poulose2019-03-191-22/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings") made the checks to skip huge mappings, stricter. However it introduced a bug where we still use huge mappings, ignoring the flag to use PTE mappings, by not reseting the vma_pagesize to PAGE_SIZE. Also, the checks do not cover the PUD huge pages, that was under review during the same period. This patch fixes both the issues. Fixes : 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings") Reported-by: Zenghui Yu <yuzenghui@huawei.com> Cc: Zenghui Yu <yuzenghui@huawei.com> Cc: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-its: Take the srcu lock when parsing the memslotsMarc Zyngier2019-03-191-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling kvm_is_visible_gfn() implies that we're parsing the memslots, and doing this without the srcu lock is frown upon: [12704.164532] ============================= [12704.164544] WARNING: suspicious RCU usage [12704.164560] 5.1.0-rc1-00008-g600025238f51-dirty #16 Tainted: G W [12704.164573] ----------------------------- [12704.164589] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage! [12704.164602] other info that might help us debug this: [12704.164616] rcu_scheduler_active = 2, debug_locks = 1 [12704.164631] 6 locks held by qemu-system-aar/13968: [12704.164644] #0: 000000007ebdae4f (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0 [12704.164691] #1: 000000007d751022 (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0 [12704.164726] #2: 00000000219d2706 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164761] #3: 00000000a760aecd (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164794] #4: 000000000ef8e31d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164827] #5: 000000007a872093 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164861] stack backtrace: [12704.164878] CPU: 2 PID: 13968 Comm: qemu-system-aar Tainted: G W 5.1.0-rc1-00008-g600025238f51-dirty #16 [12704.164887] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019 [12704.164896] Call trace: [12704.164910] dump_backtrace+0x0/0x138 [12704.164920] show_stack+0x24/0x30 [12704.164934] dump_stack+0xbc/0x104 [12704.164946] lockdep_rcu_suspicious+0xcc/0x110 [12704.164958] gfn_to_memslot+0x174/0x190 [12704.164969] kvm_is_visible_gfn+0x28/0x70 [12704.164980] vgic_its_check_id.isra.0+0xec/0x1e8 [12704.164991] vgic_its_save_tables_v0+0x1ac/0x330 [12704.165001] vgic_its_set_attr+0x298/0x3a0 [12704.165012] kvm_device_ioctl_attr+0x9c/0xd8 [12704.165022] kvm_device_ioctl+0x8c/0xf8 [12704.165035] do_vfs_ioctl+0xc8/0x960 [12704.165045] ksys_ioctl+0x8c/0xa0 [12704.165055] __arm64_sys_ioctl+0x28/0x38 [12704.165067] el0_svc_common+0xd8/0x138 [12704.165078] el0_svc_handler+0x38/0x78 [12704.165089] el0_svc+0x8/0xc Make sure the lock is taken when doing this. Fixes: bf308242ab98 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock") Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm/arm64: vgic-its: Take the srcu lock when writing to guest memoryMarc Zyngier2019-03-194-6/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When halting a guest, QEMU flushes the virtual ITS caches, which amounts to writing to the various tables that the guest has allocated. When doing this, we fail to take the srcu lock, and the kernel shouts loudly if running a lockdep kernel: [ 69.680416] ============================= [ 69.680819] WARNING: suspicious RCU usage [ 69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted [ 69.682096] ----------------------------- [ 69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage! [ 69.683225] [ 69.683225] other info that might help us debug this: [ 69.683225] [ 69.683975] [ 69.683975] rcu_scheduler_active = 2, debug_locks = 1 [ 69.684598] 6 locks held by qemu-system-aar/4097: [ 69.685059] #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0 [ 69.686087] #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0 [ 69.686919] #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.687698] #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.688475] #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.689978] #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.690729] [ 69.690729] stack backtrace: [ 69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18 [ 69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019 [ 69.692831] Call trace: [ 69.694072] lockdep_rcu_suspicious+0xcc/0x110 [ 69.694490] gfn_to_memslot+0x174/0x190 [ 69.694853] kvm_write_guest+0x50/0xb0 [ 69.695209] vgic_its_save_tables_v0+0x248/0x330 [ 69.695639] vgic_its_set_attr+0x298/0x3a0 [ 69.696024] kvm_device_ioctl_attr+0x9c/0xd8 [ 69.696424] kvm_device_ioctl+0x8c/0xf8 [ 69.696788] do_vfs_ioctl+0xc8/0x960 [ 69.697128] ksys_ioctl+0x8c/0xa0 [ 69.697445] __arm64_sys_ioctl+0x28/0x38 [ 69.697817] el0_svc_common+0xd8/0x138 [ 69.698173] el0_svc_handler+0x38/0x78 [ 69.698528] el0_svc+0x8/0xc The fix is to obviously take the srcu lock, just like we do on the read side of things since bf308242ab98. One wonders why this wasn't fixed at the same time, but hey... Fixes: bf308242ab98 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock") Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabledMarc Zyngier2019-03-192-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The normal interrupt flow is not to enable the vgic when no virtual interrupt is to be injected (i.e. the LRs are empty). But when a guest is likely to use GICv4 for LPIs, we absolutely need to switch it on at all times. Otherwise, VLPIs only get delivered when there is something in the LRs, which doesn't happen very often. Reported-by: Nianyao Tang <tangnianyao@huawei.com> Tested-by: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * | KVM: arm64: Reset the PMU in preemptible contextMarc Zyngier2019-03-191-3/+3
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've become very cautious to now always reset the vcpu when nothing is loaded on the physical CPU. To do so, we now disable preemption and do a kvm_arch_vcpu_put() to make sure we have all the state in memory (and that it won't be loaded behind out back). This now causes issues with resetting the PMU, which calls into perf. Perf itself uses mutexes, which clashes with the lack of preemption. It is worth realizing that the PMU is fully emulated, and that no PMU state is ever loaded on the physical CPU. This means we can perfectly reset the PMU outside of the non-preemptible section. Fixes: e761a927bc9a ("KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded") Reported-by: Julien Grall <julien.grall@arm.com> Tested-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * | Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGIONPaolo Bonzini2019-03-281-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | The documentation does not mention how to delete a slot, add the information. Reported-by: Nathaniel McCallum <npmccallum@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: doc: Document the life cycle of a VM and its resourcesSean Christopherson2019-03-281-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The series to add memcg accounting to KVM allocations[1] states: There are many KVM kernel memory allocations which are tied to the life of the VM process and should be charged to the VM process's cgroup. While it is correct to account KVM kernel allocations to the cgroup of the process that created the VM, it's technically incorrect to state that the KVM kernel memory allocations are tied to the life of the VM process. This is because the VM itself, i.e. struct kvm, is not tied to the life of the process which created it, rather it is tied to the life of its associated file descriptor. In other words, kvm_destroy_vm() is not invoked until fput() decrements its associated file's refcount to zero. A simple example is to fork() in Qemu and have the child sleep indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file descriptor *and* the rogue child is killed. The allocations are guaranteed to be *accounted* to the process which created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the ioctl() with -EIO if kvm->mm != current->mm. I.e. the child can keep the VM "alive" but can't do anything useful with its reference. Note that because 'struct kvm' also holds a reference to the mm_struct of its owner, the above behavior also applies to userspace allocations. Given that mucking with a VM's file descriptor can lead to subtle and undesirable behavior, e.g. memcg charges persisting after a VM is shut down, explicitly document a VM's lifecycle and its impact on the VM's resources. Alternatively, KVM could aggressively free resources when the creating process exits, e.g. via mmu_notifier->release(). However, mmu_notifier isn't guaranteed to be available, and freeing resources when the creator exits is likely to be error prone and fragile as KVM would need to ensure that it only freed resources that are truly out of reach. In practice, the existing behavior shouldn't be problematic as a properly configured system will prevent a child process from being moved out of the appropriate cgroup hierarchy, i.e. prevent hiding the process from the OOM killer, and will prevent an unprivileged user from being able to to hold a reference to struct kvm via another method, e.g. debugfs. [1]https://patchwork.kernel.org/patch/10806707/ Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>