summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* thp: huge zero page: basic preparationKirill A. Shutemov2012-12-131-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During testing I noticed big (up to 2.5 times) memory consumption overhead on some workloads (e.g. ft.A from NPB) if THP is enabled. The main reason for that big difference is lacking zero page in THP case. We have to allocate a real page on read page fault. A program to demonstrate the issue: #include <assert.h> #include <stdlib.h> #include <unistd.h> #define MB 1024*1024 int main(int argc, char **argv) { char *p; int i; posix_memalign((void **)&p, 2 * MB, 200 * MB); for (i = 0; i < 200 * MB; i+= 4096) assert(p[i] == 0); pause(); return 0; } With thp-never RSS is about 400k, but with thp-always it's 200M. After the patcheset thp-always RSS is 400k too. Design overview. Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. The way how we allocate it changes in the patchset: - [01/10] simplest way: hzp allocated on boot time in hugepage_init(); - [09/10] lazy allocation on first use; - [10/10] lockless refcounting + shrinker-reclaimable hzp; We setup it in do_huge_pmd_anonymous_page() if area around fault address is suitable for THP and we've got read page fault. If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we normally do in THP. On wp fault to hzp we allocate real memory for the huge page and clear it. If ENOMEM, graceful fallback: we create a new pmd table and set pte around fault address to newly allocated normal (4k) page. All other ptes in the pmd set to normal zero page. We cannot split hzp (and it's bug if we try), but we can split the pmd which points to it. On splitting the pmd we create a table with all ptes set to normal zero page. === By hpa's request I've tried alternative approach for hzp implementation (see Virtual huge zero page patchset): pmd table with all entries set to zero page. This way should be more cache friendly, but it increases TLB pressure. The problem with virtual huge zero page: it requires per-arch enabling. We need a way to mark that pmd table has all ptes set to zero page. Some numbers to compare two implementations (on 4s Westmere-EX): Mirobenchmark1 ============== test: posix_memalign((void **)&p, 2 * MB, 8 * GB); for (i = 0; i < 100; i++) { assert(memcmp(p, p + 4*GB, 4*GB) == 0); asm volatile ("": : :"memory"); } hzp: Performance counter stats for './test_memcmp' (5 runs): 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% ) 40 context-switches # 0.001 K/sec ( +- 0.94% ) 0 CPU-migrations # 0.000 K/sec 4,218 page-faults # 0.130 K/sec ( +- 0.00% ) 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%] 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%] 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%] 134,355,715,816 instructions # 1.75 insns per cycle # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%] 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%] 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%] 32.413866442 seconds time elapsed ( +- 0.13% ) vhzp: Performance counter stats for './test_memcmp' (5 runs): 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% ) 38 context-switches # 0.001 K/sec ( +- 1.53% ) 0 CPU-migrations # 0.000 K/sec 4,218 page-faults # 0.139 K/sec ( +- 0.01% ) 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%] 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%] 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%] 134,982,215,437 instructions # 1.88 insns per cycle # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%] 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%] 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%] 30.381324695 seconds time elapsed ( +- 0.13% ) Mirobenchmark2 ============== test: posix_memalign((void **)&p, 2 * MB, 8 * GB); for (i = 0; i < 1000; i++) { char *_p = p; while (_p < p+4*GB) { assert(*_p == *(_p+4*GB)); _p += 4096; asm volatile ("": : :"memory"); } } hzp: Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs): 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% ) 9 context-switches # 0.003 K/sec ( +- 4.97% ) 4,384 page-faults # 0.001 M/sec ( +- 0.00% ) 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%] 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%] 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%] 9,494,670,537 instructions # 1.14 insns per cycle # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%] 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%] 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%] 3,168,102,115 L1-dcache-loads # 903.693 M/sec ( +- 0.11% ) [41.70%] 1,048,710,998 L1-dcache-misses # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%] 1,047,699,685 LLC-load # 298.854 M/sec ( +- 0.03% ) [33.38%] 2,287 LLC-misses # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%] 3,166,187,367 dTLB-loads # 903.147 M/sec ( +- 0.02% ) [33.35%] 4,266,538 dTLB-misses # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%] 3.513339813 seconds time elapsed ( +- 0.26% ) vhzp: Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs): 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% ) 62 context-switches # 0.002 K/sec ( +- 0.61% ) 4,384 page-faults # 0.160 K/sec ( +- 0.01% ) 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%] 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%] 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%] 10,033,724,846 instructions # 0.15 insns per cycle # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%] 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%] 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%] 3,302,006,540 L1-dcache-loads # 120.891 M/sec ( +- 0.11% ) [41.68%] 271,374,358 L1-dcache-misses # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%] 20,385,476 LLC-load # 0.746 M/sec ( +- 1.64% ) [33.34%] 76,754 LLC-misses # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%] 3,309,927,290 dTLB-loads # 121.181 M/sec ( +- 0.03% ) [33.34%] 2,098,967,427 dTLB-misses # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%] 27.364448741 seconds time elapsed ( +- 0.24% ) === I personally prefer implementation present in this patchset. It doesn't touch arch-specific code. This patch: Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with zeros. For now let's allocate the page on hugepage_init(). We'll switch to lazy allocation later. We are not going to map the huge zero page until we can handle it properly on all code paths. is_huge_zero_{pfn,pmd}() functions will be used by following patches to check whether the pfn/pmd is huge zero page. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bootmem: remove alloc_arch_preferred_bootmem()Joonsoo Kim2012-12-131-16/+4
| | | | | | | | | | | | | | | | | The name of this function is not suitable, and removing the function and open-coding it into each call sites makes the code more understandable. Additionally, we shouldn't do an allocation from bootmem when slab_is_available(), so directly return kmalloc()'s return value. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bootmem: remove not implemented function call, bootmem_arch_preferred_node()Joonsoo Kim2012-12-131-12/+0
| | | | | | | | | | | | | | There is no implementation of bootmem_arch_preferred_node() and a call to this function will cause a compilation error. So remove it. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2012-12-12273-4221/+1375
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull big execve/kernel_thread/fork unification series from Al Viro: "All architectures are converted to new model. Quite a bit of that stuff is actually shared with architecture trees; in such cases it's literally shared branch pulled by both, not a cherry-pick. A lot of ugliness and black magic is gone (-3KLoC total in this one): - kernel_thread()/kernel_execve()/sys_execve() redesign. We don't do syscalls from kernel anymore for either kernel_thread() or kernel_execve(): kernel_thread() is essentially clone(2) with callback run before we return to userland, the callbacks either never return or do successful do_execve() before returning. kernel_execve() is a wrapper for do_execve() - it doesn't need to do transition to user mode anymore. As a result kernel_thread() and kernel_execve() are arch-independent now - they live in kernel/fork.c and fs/exec.c resp. sys_execve() is also in fs/exec.c and it's completely architecture-independent. - daemonize() is gone, along with its parts in fs/*.c - struct pt_regs * is no longer passed to do_fork/copy_process/ copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump. - sys_fork()/sys_vfork()/sys_clone() unified; some architectures still need wrappers (ones with callee-saved registers not saved in pt_regs on syscall entry), but the main part of those suckers is in kernel/fork.c now." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits) do_coredump(): get rid of pt_regs argument print_fatal_signal(): get rid of pt_regs argument ptrace_signal(): get rid of unused arguments get rid of ptrace_signal_deliver() arguments new helper: signal_pt_regs() unify default ptrace_signal_deliver flagday: kill pt_regs argument of do_fork() death to idle_regs() don't pass regs to copy_process() flagday: don't pass regs to copy_thread() bfin: switch to generic vfork, get rid of pointless wrappers xtensa: switch to generic clone() openrisc: switch to use of generic fork and clone unicore32: switch to generic clone(2) score: switch to generic fork/vfork/clone c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone() take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h mn10300: switch to generic fork/vfork/clone h8300: switch to generic fork/vfork/clone tile: switch to generic clone() ... Conflicts: arch/microblaze/include/asm/Kbuild
| * do_coredump(): get rid of pt_regs argumentAl Viro2012-11-293-5/+5
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * print_fatal_signal(): get rid of pt_regs argumentAl Viro2012-11-291-2/+3
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ptrace_signal(): get rid of unused argumentsAl Viro2012-11-291-4/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * get rid of ptrace_signal_deliver() argumentsAl Viro2012-11-294-5/+5
| | | | | | | | | | | | | | the first one is equal to signal_pt_regs(), the second is never used (and always NULL, while we are at it). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * new helper: signal_pt_regs()Al Viro2012-11-293-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Always equal to task_pt_regs(current); defined only when we are in signal delivery. It may be different from current_pt_regs() - e.g. architectures like m68k may have pt_regs location on exception different from that on a syscall and signals (just as ptrace handling) may happen on exceptions as well as on syscalls. When they are equal, it's often better to have signal_pt_regs defined (in asm/ptrace.h) as current_pt_regs - that tends to be optimized better than default would be. However, optimisation is the only reason why we might want an arch-specific definition; if current_pt_regs() and task_pt_regs(current) have different values, the latter one is right. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * unify default ptrace_signal_deliverAl Viro2012-11-2918-42/+6
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * flagday: kill pt_regs argument of do_fork()Al Viro2012-11-299-26/+19
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * death to idle_regs()Al Viro2012-11-294-22/+0
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * don't pass regs to copy_process()Al Viro2012-11-291-4/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * flagday: don't pass regs to copy_thread()Al Viro2012-11-2934-60/+45
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * bfin: switch to generic vfork, get rid of pointless wrappersAl Viro2012-11-295-54/+7
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * xtensa: switch to generic clone()Al Viro2012-11-295-11/+3
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * openrisc: switch to use of generic fork and cloneAl Viro2012-11-296-66/+19
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * unicore32: switch to generic clone(2)Al Viro2012-11-294-25/+7
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * score: switch to generic fork/vfork/cloneAl Viro2012-11-296-58/+8
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic ↵Al Viro2012-11-294-36/+6
| | | | | | | | | | | | clone() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.hAl Viro2012-11-2912-63/+11
| | | | | | | | | | | | now it can be done... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * mn10300: switch to generic fork/vfork/cloneAl Viro2012-11-292-28/+8
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * h8300: switch to generic fork/vfork/cloneAl Viro2012-11-293-44/+6
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * tile: switch to generic clone()Al Viro2012-11-292-8/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * m68k: sanitize copy_thread(), fork/vfork/clone wrappers, switch to generic ↵Al Viro2012-11-294-71/+46
| | | | | | | | | | | | fork/vfork Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * alpha: switch to generic fork/vfork/cloneAl Viro2012-11-294-59/+17
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * s390: switch to generic fork/vfork/cloneAl Viro2012-11-293-42/+14
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * microblaze: switch to generic fork/vfork/cloneAl Viro2012-11-297-56/+12
| | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Michal Simek <michal.simek@xilinx.com>
| * hexagon: switch to generic clone()Al Viro2012-11-294-52/+6
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * sh: switch to generic fork/vfork/cloneAl Viro2012-11-295-110/+11
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * parisc: switch to generic fork/vfork/cloneAl Viro2012-11-294-69/+22
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * m32r: switch to generic fork/vfork/cloneAl Viro2012-11-292-42/+7
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * cris: switch to generic fork/vfork/cloneAl Viro2012-11-294-60/+13
| | | | | | | | | | | | | | same braindamage as on s390 - the first two arguments of clone(2) in the wrong order. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * avr32: sanitize copy_thread(), switch to generic fork/vfork/clone, kill wrappersAl Viro2012-11-294-47/+11
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * arm64: sanitize copy_thread(), switch to generic fork/vfork/cloneAl Viro2012-11-297-28/+12
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * frv: switch to generic fork/vfork/cloneAl Viro2012-11-292-40/+8
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * powerpc: switch to generic fork/clone/vforkAl Viro2012-11-294-32/+4
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * arm: switch to generic fork/vfork/cloneAl Viro2012-11-296-55/+13
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * x86, um: switch to generic fork/vfork/cloneAl Viro2012-11-2916-117/+52
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * get rid of pt_regs argument of ->load_binary()Al Viro2012-11-2912-20/+25
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * get rid of pt_regs argument of search_binary_handler()Al Viro2012-11-296-9/+8
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * get rid of pt_regs argument of do_execve_common()Al Viro2012-11-291-4/+4
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * get rid of pt_regs argument of do_execve()Al Viro2012-11-293-13/+8
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * make compat_do_execve() static, lose pt_regs argumentAl Viro2012-11-292-8/+4
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill stray kernel_thread() garbageAl Viro2012-11-294-6/+0
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * audit: no nested contexts anymore...Al Viro2012-11-291-81/+21
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * consolidate sys_execve() prototypeAl Viro2012-11-296-18/+3
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| *-----------------. Merge branches 'no-rebases', 'arch-avr32', 'arch-blackfin', 'arch-cris', ↵Al Viro2012-11-29165-2781/+982
| |\ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'arch-h8300', 'arch-m32r', 'arch-mn10300', 'arch-score', 'arch-sh' and 'arch-powerpc' into for-next
| | | | | | | | | | | * powerpc: make fork_idle() take the common "kernel thread" path in copy_thread()Al Viro2012-10-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| | | | | | | | | | | * powerpc: put the "zero usp means using parent's stack pointer" to copy_thread()Al Viro2012-10-221-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | simplifies callers, at that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>