summaryrefslogtreecommitdiffstats
path: root/include (follow)
Commit message (Collapse)AuthorAgeFilesLines
* printk: add and use LOGLEVEL_<level> defines for KERN_<LEVEL> equivalentsJoe Perches2014-12-111-0/+13
| | | | | | | | | | Use #defines instead of magic values. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* printk: remove used-once early_vprintkJoe Perches2014-12-111-1/+0
| | | | | | | | | | | | | | | | | Eliminate the unlikely possibility of message interleaving for early_printk/early_vprintk use. early_vprintk can be done via the %pV extension so remove this unnecessary function and change early_printk to have the equivalent vprintk code. All uses of early_printk already end with a newline so also remove the unnecessary newline from the early_printk function. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel: add panic_on_warnPrarit Bhargava2014-12-112-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There have been several times where I have had to rebuild a kernel to cause a panic when hitting a WARN() in the code in order to get a crash dump from a system. Sometimes this is easy to do, other times (such as in the case of a remote admin) it is not trivial to send new images to the user. A much easier method would be a switch to change the WARN() over to a panic. This makes debugging easier in that I can now test the actual image the WARN() was seen on and I do not have to engage in remote debugging. This patch adds a panic_on_warn kernel parameter and /proc/sys/kernel/panic_on_warn calls panic() in the warn_slowpath_common() path. The function will still print out the location of the warning. An example of the panic_on_warn output: The first line below is from the WARN_ON() to output the WARN_ON()'s location. After that the panic() output is displayed. WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]() Kernel panic - not syncing: panic_on_warn set ... CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013 0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190 0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df Call Trace: [<ffffffff81665190>] dump_stack+0x46/0x58 [<ffffffff8165e2ec>] panic+0xd0/0x204 [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module] [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0 [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module] [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module] [<ffffffff81002144>] do_one_initcall+0xd4/0x210 [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110 [<ffffffff810f8889>] load_module+0x16a9/0x1b30 [<ffffffff810f3d30>] ? store_uevent+0x70/0x70 [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180 [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0 [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17 Successfully tested by me. hpa said: There is another very valid use for this: many operators would rather a machine shuts down than being potentially compromised either functionally or security-wise. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/file.h: remove get_unused_fd() macroYann Droneaud2014-12-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Macro get_unused_fd() is used to allocate a file descriptor with default flags. Those default flags (0) don't enable close-on-exec. This can be seen as an unsafe default: in most case close-on-exec should be enabled to not leak file descriptor across exec(). It would be better to have a "safer" default set of flags, eg. O_CLOEXEC must be used to enable close-on-exec. Instead this patch removes get_unused_fd() so that out of tree modules won't be affect by a runtime behavor change which might introduce other kind of bugs: it's better to catch the change at build time, making it easier to fix. Removing the macro will also promote use of get_unused_fd_flags() (or anon_inode_getfd()) with flags provided by userspace. Or, if flags cannot be given by userspace, with flags set to O_CLOEXEC by default. Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* exit: ptrace: shift "reap dead" code from exit_ptrace() to ↵Oleg Nesterov2014-12-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | forget_original_parent() Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks, we can simply pass "dead_children" list to exit_ptrace() and remove another release_task() loop. Plus this way we do not need to drop and reacquire tasklist_lock. Also shift the list_empty(ptraced) check, if we want this optimization it makes sense to eliminate the function call altogether. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move page->mem_cgroup bad page handling into generic codeJohannes Weiner2014-12-111-17/+0
| | | | | | | | | | | | | | | | Now that the external page_cgroup data structure and its lookup is gone, let the generic bad_page() check for page->mem_cgroup sanity. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_cgroup: rename file to mm/swap_cgroup.cJohannes Weiner2014-12-111-3/+5
| | | | | | | | | | | | | | | | | | Now that the external page_cgroup data structure and its lookup is gone, the only code remaining in there is swap slot accounting. Rename it and move the conditional compilation into mm/Makefile. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: embed the memcg pointer directly into struct pageJohannes Weiner2014-12-114-70/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory cgroups used to have 5 per-page pointers. To allow users to disable that amount of overhead during runtime, those pointers were allocated in a separate array, with a translation layer between them and struct page. There is now only one page pointer remaining: the memcg pointer, that indicates which cgroup the page is associated with when charged. The complexity of runtime allocation and the runtime translation overhead is no longer justified to save that *potential* 0.19% of memory. With CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding after the page->private member and doesn't even increase struct page, and then this patch actually saves space. Remaining users that care can still compile their kernels without CONFIG_MEMCG. text data bss dec hex filename 8828345 1725264 983040 11536649 b00909 vmlinux.old 8827425 1725264 966656 11519345 afc571 vmlinux.new [mhocko@suse.cz: update Documentation/cgroups/memory.txt] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, memcg: fix potential undefined behaviour in page stat accountingMichal Hocko2014-12-111-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit d7365e783edb ("mm: memcontrol: fix missed end-writeback page accounting") mem_cgroup_end_page_stat consumes locked and flags variables directly rather than via pointers which might trigger C undefined behavior as those variables are initialized only in the slow path of mem_cgroup_begin_page_stat. Although mem_cgroup_end_page_stat handles parameters correctly and touches them only when they hold a sensible value it is caller which loads a potentially uninitialized value which then might allow compiler to do crazy things. I haven't seen any warning from gcc and it seems that the current version (4.9) doesn't exploit this type undefined behavior but Sasha has reported the following: UBSan: Undefined behaviour in mm/rmap.c:1084:2 load of value 255 is not a valid value for type '_Bool' CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427 Call Trace: dump_stack (lib/dump_stack.c:52) ubsan_epilogue (lib/ubsan.c:159) __ubsan_handle_load_invalid_value (lib/ubsan.c:482) page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096) unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303) unmap_single_vma (mm/memory.c:1348) unmap_vmas (mm/memory.c:1377 (discriminator 3)) exit_mmap (mm/mmap.c:2837) mmput (kernel/fork.c:659) do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747) do_group_exit (include/linux/sched.h:775 kernel/exit.c:873) SyS_exit_group (kernel/exit.c:901) tracesys_phase2 (arch/x86/kernel/entry_64.S:529) Fix this by using pointer parameters for both locked and flags and be more robust for future compiler changes even though the current code is implemented correctly. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree()Johannes Weiner2014-12-111-7/+6
| | | | | | | | | | | | | | | None of the mem_cgroup_same_or_subtree() callers actually require it to take the RCU lock, either because they hold it themselves or they have css references. Remove it. To make the API change clear, rename the leftover helper to mem_cgroup_is_descendant() to match cgroup_is_descendant(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: pull the NULL check from __mem_cgroup_same_or_subtree()Johannes Weiner2014-12-111-2/+3
| | | | | | | | | | | | | | The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner. It makes a lot more sense to check where it's looked up, rather than check for it in __mem_cgroup_same_or_subtree() where it's unexpected. No other callsite passes NULL to __mem_cgroup_same_or_subtree(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: use generic slab iterators for showing slabinfoVladimir Davydov2014-12-111-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Let's use generic slab_start/next/stop for showing memcg caches info. In contrast to the current implementation, this will work even if all memcg caches' info doesn't fit into a seq buffer (a page), plus it simply looks neater. Actually, the main reason I do this isn't mere cleanup. I'm going to zap the memcg_slab_caches list, because I find it useless provided we have the slab_caches list, and this patch is a step in this direction. It should be noted that before this patch an attempt to read memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted in -EIO, while after this patch it will silently show nothing except the header, but I don't think it will frustrate anyone. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, hugetlb: correct bit shift in hstate_sizelog()Sasha Levin2014-12-111-1/+2
| | | | | | | | | | | | | hstate_sizelog() would shift left an int rather than long, triggering undefined behaviour and passing an incorrect value when the requested page size was more than 4GB, thus breaking >4GB pages. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: remove unnecessary PCG_USED pc->mem_cgroup valid flagJohannes Weiner2014-12-111-10/+0
| | | | | | | | | | | | | | | | | | | | pc->mem_cgroup had to be left intact after uncharge for the final LRU removal, and !PCG_USED indicated whether the page was uncharged. But since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") pages are uncharged after the final LRU removal. Uncharge can simply clear the pointer and the PCG_USED/PageCgroupUsed sites can test that instead. Because this is the last page_cgroup flag, this patch reduces the memcg per-page overhead to a single pointer. [akpm@linux-foundation.org: remove unneeded initialization of `memcg', per Michal] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: remove unnecessary PCG_MEM memory charge flagJohannes Weiner2014-12-111-1/+0
| | | | | | | | | | | | | | | | PCG_MEM is a remnant from an earlier version of 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API"), used to tell whether migration cleared a charge while leaving pc->mem_cgroup valid and PCG_USED set. But in the final version, mem_cgroup_migrate() directly uncharges the source page, rendering this distinction unnecessary. Remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: remove unnecessary PCG_MEMSW memory+swap charge flagJohannes Weiner2014-12-111-1/+0
| | | | | | | | | | | | | | | Now that mem_cgroup_swapout() fully uncharges the page, every page that is still in use when reaching mem_cgroup_uncharge() is known to carry both the memory and the memory+swap charge. Simplify the uncharge path and remove the PCG_MEMSW page flag accordingly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: simplify deferred compactionVlastimil Babka2014-12-111-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 53853e2d2bfb ("mm, compaction: defer each zone individually instead of preferred zone"), compaction is deferred for each zone where sync direct compaction fails, and reset where it succeeds. However, it was observed that for DMA zone compaction often appeared to succeed while subsequent allocation attempt would not, due to different outcome of watermark check. In order to properly defer compaction in this zone, the candidate zone has to be passed back to __alloc_pages_direct_compact() and compaction deferred in the zone after the allocation attempt fails. The large source of mismatch between watermark check in compaction and allocation was the lack of alloc_flags and classzone_idx values in compaction, which has been fixed in the previous patch. So with this problem fixed, we can simplify the code by removing the candidate_zone parameter and deferring in __alloc_pages_direct_compact(). After this patch, the compaction activity during stress-highalloc benchmark is still somewhat increased, but it's negligible compared to the increase that occurred without the better watermark checking. This suggests that it is still possible to apparently succeed in compaction but fail to allocate, possibly due to parallel allocation activity. [akpm@linux-foundation.org: fix build] Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, compaction: pass classzone_idx and alloc_flags to watermark checkingVlastimil Babka2014-12-111-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compaction relies on zone watermark checks for decisions such as if it's worth to start compacting in compaction_suitable() or whether compaction should stop in compact_finished(). The watermark checks take classzone_idx and alloc_flags parameters, which are related to the memory allocation request. But from the context of compaction they are currently passed as 0, including the direct compaction which is invoked to satisfy the allocation request, and could therefore know the proper values. The lack of proper values can lead to mismatch between decisions taken during compaction and decisions related to the allocation request. Lack of proper classzone_idx value means that lowmem_reserve is not taken into account. This has manifested (during recent changes to deferred compaction) when DMA zone was used as fallback for preferred Normal zone. compaction_suitable() without proper classzone_idx would think that the watermarks are already satisfied, but watermark check in get_page_from_freelist() would fail. Because of this problem, deferring compaction has extra complexity that can be removed in the following patch. The issue (not confirmed in practice) with missing alloc_flags is opposite in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on CMA-enabled systems) the watermark checking in compaction with 0 passed will be stricter than in get_page_from_freelist(). In these cases compaction might be running for a longer time than is really needed. Another issue compaction_suitable() is that the check for "does the zone need compaction at all?" comes only after the check "does the zone have enough free free pages to succeed compaction". The latter considers extra pages for migration and can therefore in some situations fail and return COMPACT_SKIPPED, although the high-order allocation would succeed and we should return COMPACT_PARTIAL. This patch fixes these problems by adding alloc_flags and classzone_idx to struct compact_control and related functions involved in direct compaction and watermark checking. Where possible, all other callers of compaction_suitable() pass proper values where those are known. This is currently limited to classzone_idx, which is sometimes known in kswapd context. However, the direct reclaim callers should_continue_reclaim() and compaction_ready() do not currently know the proper values, so the coordination between reclaim and compaction may still not be as accurate as it could. This can be fixed later, if it's shown to be an issue. Additionaly the checks in compact_suitable() are reordered to address the second issue described above. The effect of this patch should be slightly better high-order allocation success rates and/or less compaction overhead, depending on the type of allocations and presence of CMA. It allows simplifying deferred compaction code in a followup patch. When testing with stress-highalloc, there was some slight improvement (which might be just due to variance) in success rates of non-THP-like allocations. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: introduce single zone pcplists drainVlastimil Babka2014-12-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The functions for draining per-cpu pages back to buddy allocators currently always operate on all zones. There are however several cases where the drain is only needed in the context of a single zone, and spilling other pcplists is a waste of time both due to the extra spilling and later refilling. This patch introduces new zone pointer parameter to drain_all_pages() and changes the dummy parameter of drain_local_pages() to be also a zone pointer. When NULL is passed, the functions operate on all zones as usual. Passing a specific zone pointer reduces the work to the single zone. All callers are updated to pass the NULL pointer in this patch. Conversion to single zone (where appropriate) is done in further patches. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: remove obsolete kmemcg pinning tricksJohannes Weiner2014-12-111-2/+2
| | | | | | | | | | | | | | | | | As charges now pin the css explicitely, there is no more need for kmemcg to acquire a proxy reference for outstanding pages during offlining, or maintain state to identify such "dead" groups. This was the last user of the uncharge functions' return values, so remove them as well. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: take a css reference for each charged pageJohannes Weiner2014-12-112-9/+64
| | | | | | | | | | | | | | | | | | Charges currently pin the css indirectly by playing tricks during css_offline(): user pages stall the offlining process until all of them have been reparented, whereas kmemcg acquires a keep-alive reference if outstanding kernel pages are detected at that point. In preparation for removing all this complexity, make the pinning explicit and acquire a css references for every charged page. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel: res_counter: remove the unused APIJohannes Weiner2014-12-111-223/+0
| | | | | | | | | | | | | | | | | All memory accounting and limiting has been switched over to the lockless page counters. Bye, res_counter! [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt] [mhocko@suse.cz: ditch the last remainings of res_counter] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: hugetlb_cgroup: convert to lockless page countersJohannes Weiner2014-12-111-1/+0
| | | | | | | | | | | | | Abandon the spinlock-protected byte counters in favor of the unlocked page counters in the hugetlb controller as well. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: lockless page countersJohannes Weiner2014-12-113-20/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory is internally accounted in bytes, using spinlock-protected 64-bit counters, even though the smallest accounting delta is a page. The counter interface is also convoluted and does too many things. Introduce a new lockless word-sized page counter API, then change all memory accounting over to it. The translation from and to bytes then only happens when interfacing with userspace. The removed locking overhead is noticable when scaling beyond the per-cpu charge caches - on a 4-socket machine with 144-threads, the following test shows the performance differences of 288 memcgs concurrently running a page fault benchmark: vanilla: 18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% ) 1,380,638 context-switches # 0.074 K/sec ( +- 0.75% ) 24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% ) 1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% ) 50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% ) 1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% ) 1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% ) 132.474343877 seconds time elapsed ( +- 0.21% ) lockless: 12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% ) 832,850 context-switches # 0.068 K/sec ( +- 0.54% ) 15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% ) 1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% ) 32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% ) 2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% ) 1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% ) 91.369330729 seconds time elapsed ( +- 0.45% ) On top of improved scalability, this also gets rid of the icky long long types in the very heart of memcg, which is great for 32 bit and also makes the code a lot more readable. Notable differences between the old and new API: - res_counter_charge() and res_counter_charge_nofail() become page_counter_try_charge() and page_counter_charge() resp. to match the more common kernel naming scheme of try_do()/do() - res_counter_uncharge_until() is only ever used to cancel a local counter and never to uncharge bigger segments of a hierarchy, so it's replaced by the simpler page_counter_cancel() - res_counter_set_limit() is replaced by page_counter_limit(), which expects its callers to serialize against themselves - res_counter_memparse_write_strategy() is replaced by page_counter_limit(), which rounds down to the nearest page size - rather than up. This is more reasonable for explicitely requested hard upper limits. - to keep charging light-weight, page_counter_try_charge() charges speculatively, only to roll back if the result exceeds the limit. Because of this, a failing bigger charge can temporarily lock out smaller charges that would otherwise succeed. The error is bounded to the difference between the smallest and the biggest possible charge size, so for memcg, this means that a failing THP charge can send base page charges into reclaim upto 2MB (4MB) before the limit would have been reached. This should be acceptable. [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse] [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'timers-2038-for-linus' of ↵Linus Torvalds2014-12-102-0/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more 2038 timer work from Thomas Gleixner: "Two more patches for the ongoing 2038 work: - New accessors to clock MONOTONIC and REALTIME seconds This is a seperate branch as Arnd has follow up work depending on this" * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Provide y2038 safe accessor to the seconds portion of CLOCK_REALTIME timekeeping: Provide fast accessor to the seconds part of CLOCK_MONOTONIC
| * timekeeping: Provide y2038 safe accessor to the seconds portion of ↵Heena Sirwani2014-10-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CLOCK_REALTIME ktime_get_real_seconds() is the replacement function for get_seconds() returning the seconds portion of CLOCK_REALTIME in a time64_t. For 64bit the function is equivivalent to get_seconds(), but for 32bit it protects the readout with the timekeeper sequence count. This is required because 32-bit machines cannot access 64-bit tk->xtime_sec variable atomically. [tglx: Massaged changelog and added docbook comment ] Signed-off-by: Heena Sirwani <heenasirwani@gmail.com> Reviewed-by: Arnd Bergman <arnd@arndb.de> Cc: John Stultz <john.stultz@linaro.org> Cc: opw-kernel@googlegroups.com Link: http://lkml.kernel.org/r/7adcfaa8962b8ad58785d9a2456c3f77d93c0ffb.1414578445.git.heenasirwani@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * timekeeping: Provide fast accessor to the seconds part of CLOCK_MONOTONICHeena Sirwani2014-10-292-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the counterpart to get_seconds() based on CLOCK_MONOTONIC. The use case for this interface are kernel internal coarse grained timestamps which do neither require the nanoseconds fraction of current time nor the CLOCK_REALTIME properties. Such timestamps can currently only retrieved by calling ktime_get_ts64() and using the tv_sec field of the returned timespec64. That's inefficient as it involves the read of the clocksource, math operations and must be protected by the timekeeper sequence counter. To avoid the sequence counter protection we restrict the return value to unsigned 32bit on 32bit machines. This covers ~136 years of uptime and therefor an overflow is not expected to hit anytime soon. To avoid math in the function we calculate the current seconds portion of CLOCK_MONOTONIC when the timekeeper gets updated in tk_update_ktime_data() similar to the CLOCK_REALTIME counterpart xtime_sec. [ tglx: Massaged changelog, simplified and commented the update function, added docbook comment ] Signed-off-by: Heena Sirwani <heenasirwani@gmail.com> Reviewed-by: Arnd Bergman <arnd@arndb.de> Cc: John Stultz <john.stultz@linaro.org> Cc: opw-kernel@googlegroups.com Link: http://lkml.kernel.org/r/da0b63f4bdf3478909f92becb35861197da3a905.1414578445.git.heenasirwani@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | Merge branch 'x86-mpx-for-linus' of ↵Linus Torvalds2014-12-105-4/+38
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MPX support from Thomas Gleixner: "This enables support for x86 MPX. MPX is a new debug feature for bound checking in user space. It requires kernel support to handle the bound tables and decode the bound violating instruction in the trap handler" * 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: asm-generic: Remove asm-generic arch_bprm_mm_init() mm: Make arch_unmap()/bprm_mm_init() available to all architectures x86: Cleanly separate use of asm-generic/mm_hooks.h x86 mpx: Change return type of get_reg_offset() fs: Do not include mpx.h in exec.c x86, mpx: Add documentation on Intel MPX x86, mpx: Cleanup unused bound tables x86, mpx: On-demand kernel allocation of bounds tables x86, mpx: Decode MPX instruction to get bound violation information x86, mpx: Add MPX-specific mmap interface x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific x86, mpx: Add MPX to disabled features ia64: Sync struct siginfo with general version mips: Sync struct siginfo with general version mpx: Extend siginfo structure to include bound violation information x86, mpx: Rename cfg_reg_u and status_reg x86: mpx: Give bndX registers actual names x86: Remove arbitrary instruction size limit in instruction decoder
| * | asm-generic: Remove asm-generic arch_bprm_mm_init()Dave Hansen2014-11-221-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a follow-on to commit 62e88b1c00de 'mm: Make arch_unmap()/bprm_mm_init() available to all architectures' I removed the asm-generic version of arch_unmap() in that patch, but missed arch_bprm_mm_init(). So this broke the build for architectures using asm-generic/mmu_context.h who actually have an MMU. Fixes: 62e88b1c00de 'mm: Make arch_unmap()/bprm_mm_init() available to all architectures' Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20141122163711.0F037EE6@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | mm: Make arch_unmap()/bprm_mm_init() available to all architecturesDave Hansen2014-11-192-9/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The x86 MPX patch set calls arch_unmap() and arch_bprm_mm_init() from fs/exec.c, so we need at least a stub for them in all architectures. They are only called under an #ifdef for CONFIG_MMU=y, so we can at least restict this to architectures with MMU support. blackfin/c6x have no MMU support, so do not call arch_unmap(). They also do not include mm_hooks.h or mmu_context.h at all and do not need to be touched. s390, um and unicore32 do not use asm-generic/mm_hooks.h, so got their own arch_unmap() versions. (I also moved um's arch_dup_mmap() to be closer to the other mm_hooks.h functions). xtensa only includes mm_hooks when MMU=y, which should be fine since arch_unmap() is called only from MMU=y code. For the rest, we use the stub copies of these functions in asm-generic/mm_hook.h. I cross compiled defconfigs for cris (to check NOMMU) and s390 to make sure that this works. I also checked a 64-bit build of UML and all my normal x86 builds including PARAVIRT on and off. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20141118182350.8B4AA2C2@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | x86, mpx: Cleanup unused bound tablesDave Hansen2014-11-181-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patch allocates bounds tables on-demand. As noted in an earlier description, these can add up to *HUGE* amounts of memory. This has caused OOMs in practice when running tests. This patch adds support for freeing bounds tables when they are no longer in use. There are two types of mappings in play when unmapping tables: 1. The mapping with the actual data, which userspace is munmap()ing or brk()ing away, etc... 2. The mapping for the bounds table *backing* the data (is tagged with VM_MPX, see the patch "add MPX specific mmap interface"). If userspace use the prctl() indroduced earlier in this patchset to enable the management of bounds tables in kernel, when it unmaps the first type of mapping with the actual data, the kernel needs to free the mapping for the bounds table backing the data. This patch hooks in at the very end of do_unmap() to do so. We look at the addresses being unmapped and find the bounds directory entries and tables which cover those addresses. If an entire table is unused, we clear associated directory entry and free the table. Once we unmap the bounds table, we would have a bounds directory entry pointing at empty address space. That address space might now be allocated for some other (random) use, and the MPX hardware might now try to walk it as if it were a bounds table. That would be bad. So any unmapping of an enture bounds table has to be accompanied by a corresponding write to the bounds directory entry to invalidate it. That write to the bounds directory can fault, which causes the following problem: Since we are doing the freeing from munmap() (and other paths like it), we hold mmap_sem for write. If we fault, the page fault handler will attempt to acquire mmap_sem for read and we will deadlock. To avoid the deadlock, we pagefault_disable() when touching the bounds directory entry and use a get_user_pages() to resolve the fault. The unmapping of bounds tables happends under vm_munmap(). We also (indirectly) call vm_munmap() to _do_ the unmapping of the bounds tables. We avoid unbounded recursion by disallowing freeing of bounds tables *for* bounds tables. This would not occur normally, so should not have any practical impact. Being strict about it here helps ensure that we do not have an exploitable stack overflow. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | x86, mpx: On-demand kernel allocation of bounds tablesDave Hansen2014-11-183-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables *need* to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly *require* anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this *could* be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are *HUGE*. An X-GB virtual area needs 4*X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specificQiaowei Ren2014-11-181-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MPX-enabled applications using large swaths of memory can potentially have large numbers of bounds tables in process address space to save bounds information. These tables can take up huge swaths of memory (as much as 80% of the memory on the system) even if we clean them up aggressively. In the worst-case scenario, the tables can be 4x the size of the data structure being tracked. IOW, a 1-page structure can require 4 bounds-table pages. Being this huge, our expectation is that folks using MPX are going to be keen on figuring out how much memory is being dedicated to it. So we need a way to track memory use for MPX. If we want to specifically track MPX VMAs we need to be able to distinguish them from normal VMAs, and keep them from getting merged with normal VMAs. A new VM_ flag set only on MPX VMAs does both of those things. With this flag, MPX bounds-table VMAs can be distinguished from other VMAs, and userspace can also walk /proc/$pid/smaps to get memory usage for MPX. In addition to this flag, we also introduce a special ->vm_ops specific to MPX VMAs (see the patch "add MPX specific mmap interface"), but currently different ->vm_ops do not by themselves prevent VMA merging, so we still need this flag. We understand that VM_ flags are scarce and are open to other options. Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | mpx: Extend siginfo structure to include bound violation informationQiaowei Ren2014-11-181-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds new fields about bound violation into siginfo structure. si_lower and si_upper are respectively lower bound and upper bound when bound violation is caused. Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151819.1908C900@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | Merge branch 'irq-irqdomain-for-linus' of ↵Linus Torvalds2014-12-107-23/+358
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq domain updates from Thomas Gleixner: "The real interesting irq updates: - Support for hierarchical irq domains: For complex interrupt routing scenarios where more than one interrupt related chip is involved we had no proper representation in the generic interrupt infrastructure so far. That made people implement rather ugly constructs in their nested irq chip implementations. The main offenders are x86 and arm/gic. To distangle that mess we have now hierarchical irqdomains which seperate the various interrupt chips and connect them via the hierarchical domains. That keeps the domain specific details internal to the particular hierarchy level and removes the criss/cross referencing of chip internals. The resulting hierarchy for a complex x86 system will look like this: vector mapped: 74 msi-0 mapped: 2 dmar-ir-1 mapped: 69 ioapic-1 mapped: 4 ioapic-0 mapped: 20 pci-msi-2 mapped: 45 dmar-ir-0 mapped: 3 ioapic-2 mapped: 1 pci-msi-1 mapped: 2 htirq mapped: 0 Neither ioapic nor pci-msi know about the dmar interrupt remapping between themself and the vector domain. If interrupt remapping is disabled ioapic and pci-msi become direct childs of the vector domain. In hindsight we should have done that years ago, but in hindsight we always know better :) - Support for generic MSI interrupt domain handling We have more and more non PCI related MSI interrupts, so providing a generic infrastructure for this is better than having all affected architectures implementing their own private hacks. - Support for PCI-MSI interrupt domain handling, based on the generic MSI support. This part carries the pci/msi branch from Bjorn Helgaas pci tree to avoid a massive conflict. The PCI/MSI parts are acked by Bjorn. I have two more branches on top of this. The full conversion of x86 to hierarchical domains and a partial conversion of arm/gic" * 'irq-irqdomain-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) genirq: Move irq_chip_write_msi_msg() helper to core PCI/MSI: Allow an msi_controller to be associated to an irq domain PCI/MSI: Provide mechanism to alloc/free MSI/MSIX interrupt from irqdomain PCI/MSI: Enhance core to support hierarchy irqdomain PCI/MSI: Move cached entry functions to irq core genirq: Provide default callbacks for msi_domain_ops genirq: Introduce msi_domain_alloc/free_irqs() asm-generic: Add msi.h genirq: Add generic msi irq domain support genirq: Introduce callback irq_chip.irq_write_msi_msg genirq: Work around __irq_set_handler vs stacked domains ordering issues irqdomain: Introduce helper function irq_domain_add_hierarchy() irqdomain: Implement a method to automatically call parent domains alloc/free genirq: Introduce helper irq_domain_set_info() to reduce duplicated code genirq: Split out flow handler typedefs into seperate header file genirq: Add IRQ_SET_MASK_OK_DONE to support stacked irqchip genirq: Introduce irq_chip.irq_compose_msi_msg() to support stacked irqchip genirq: Add more helper functions to support stacked irq_chip genirq: Introduce helper functions to support stacked irq_chip irqdomain: Do irq_find_mapping and set_type for hierarchy irqdomain in case OF ...
| * | | genirq: Move irq_chip_write_msi_msg() helper to coreThomas Gleixner2014-12-071-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No point to expose this to the world. The only legitimate user is the core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com>
| * | | PCI/MSI: Allow an msi_controller to be associated to an irq domainMarc Zyngier2014-11-231-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the new stacked irq domains, it becomes pretty tempting to allocate an MSI domain per PCI bus, which would remove the requirement of either relying on arch-specific code, or a default PCI MSI domain. By allowing the msi_controller structure to carry a pointer to an irq_domain, we can easily use this in pci_msi_setup_msi_irqs. The existing code can still be used as a fallback if the MSI driver does not populate the domain field. Tested on arm64 with the GICv3 ITS driver. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Jiang Liu <jiang.liu@linux.intel.com> Link: http://lkml.kernel.org/r/1416048553-29289-2-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | PCI/MSI: Provide mechanism to alloc/free MSI/MSIX interrupt from irqdomainJiang Liu2014-11-231-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide mechanism to directly alloc/free MSI/MSIX interrupt from irqdomain, which will be used to replace arch_setup_msi_irq()/ arch_setup_msi_irqs()/arch_teardown_msi_irq()/arch_teardown_msi_irqs(). To kill weak functions, this patch introduce a new weak function arch_get_pci_msi_domain(), which is to retrieve the MSI irqdomain for a PCI device. This weak function could be killed once we get a common way to associate MSI domain with PCI device. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yijing Wang <wangyijing@huawei.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Alexander Gordeev <agordeev@redhat.com> Link: http://lkml.kernel.org/r/1416061447-9472-10-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | PCI/MSI: Enhance core to support hierarchy irqdomainJiang Liu2014-11-231-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enhance PCI MSI core to support hierarchy irqdomain, so the common code can be shared across architectures. [ tglx: Extracted and combined from several patches ] Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | genirq: Provide default callbacks for msi_domain_opsJiang Liu2014-11-231-5/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend struct msi_domain_info and provide default callbacks for msi_domain_ops. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yijing Wang <wangyijing@huawei.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Alexander Gordeev <agordeev@redhat.com> Link: http://lkml.kernel.org/r/1416061447-9472-8-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | genirq: Introduce msi_domain_alloc/free_irqs()Jiang Liu2014-11-231-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce msi_domain_{alloc|free}_irqs() to alloc/free interrupts from generic MSI irqdomain. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yijing Wang <wangyijing@huawei.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Alexander Gordeev <agordeev@redhat.com> Link: http://lkml.kernel.org/r/1416061447-9472-7-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | asm-generic: Add msi.hThomas Gleixner2014-11-231-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support MSI irq domains we want a generic data structure for allocation, but we need the option to provide an architecture specific version of it. So instead of playing #ifdef games in linux/msi.h we add a generic header file and let architectures decide whether to include it or to provide their own implementation and provide the required typedef. I know that typedefs are not really nice, but in this case there are no forward declarations required and it's the simplest solution. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yijing Wang <wangyijing@huawei.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Alexander Gordeev <agordeev@redhat.com>
| * | | genirq: Add generic msi irq domain supportJiang Liu2014-11-231-0/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the basic functions for MSI interrupt support with hierarchical interrupt domains. [ tglx: Extracted and combined from several patches ] Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | genirq: Introduce callback irq_chip.irq_write_msi_msgJiang Liu2014-11-231-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce callback irq_chip.irq_write_msi_msg, so we can share common code among MSI alike interrupt controllers, such as HPET and DMAR. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | irqdomain: Introduce helper function irq_domain_add_hierarchy()Jiang Liu2014-11-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce helper function irq_domain_add_hierarchy(), which creates a linear irqdomain if parameter 'size' is not zero, otherwise creates a tree irqdomain. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yijing Wang <wangyijing@huawei.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Link: http://lkml.kernel.org/r/1416061447-9472-5-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | irqdomain: Implement a method to automatically call parent domains alloc/freeJiang Liu2014-11-231-15/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a flags to irq_domain.flags to control whether the irqdomain core should automatically call parent irqdomain's alloc/free callbacks. It help to reduce hierarchy irqdomains users' code size. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yijing Wang <wangyijing@huawei.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Link: http://lkml.kernel.org/r/1416061447-9472-4-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | genirq: Introduce helper irq_domain_set_info() to reduce duplicated codeJiang Liu2014-11-231-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | genirq: Split out flow handler typedefs into seperate header fileThomas Gleixner2014-11-232-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | Required to avoid circular include dependencies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | genirq: Add IRQ_SET_MASK_OK_DONE to support stacked irqchipJiang Liu2014-11-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add IRQ_SET_MASK_OK_DONE in addition to IRQ_SET_MASK_OK and IRQ_SET_MASK_OK_NOCOPY to support stacked irqchip. IRQ_SET_MASK_OK_DONE is the same as IRQ_SET_MASK_OK to irq core. To stacked irqchip, it means that ascendant irqchips have done all the work and no more handling needed in descendant irqchips. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | genirq: Introduce irq_chip.irq_compose_msi_msg() to support stacked irqchipJiang Liu2014-11-231-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add callback irq_compose_msi_msg to struct irq_chip, which will be used to support stacked irqchip. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>