summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* printk: tweak do_syslog() to match commentsAlex Elder2014-08-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | In do_syslog() there's a path used by kmsg_poll() and kmsg_read() that only needs to know whether there's any data available to read (and not its size). These callers only check for non-zero return. As a shortcut, do_syslog() returns the difference between what has been logged and what has been "seen." The comments say that the "count of records" should be returned but it's not. Instead it returns (log_next_idx - syslog_idx), which is a difference between buffer offsets--and the result could be negative. The behavior is the same (it'll be zero or not in the same cases), but the count of records is more meaningful and it matches what the comments say. So change the code to return that. Signed-off-by: Alex Elder <elder@linaro.org> Cc: Petr Mladek <pmladek@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* printk: allow increasing the ring buffer depending on the number of CPUsLuis R. Rodriguez2014-08-073-6/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The default size of the ring buffer is too small for machines with a large amount of CPUs under heavy load. What ends up happening when debugging is the ring buffer overlaps and chews up old messages making debugging impossible unless the size is passed as a kernel parameter. An idle system upon boot up will on average spew out only about one or two extra lines but where this really matters is on heavy load and that will vary widely depending on the system and environment. There are mechanisms to help increase the kernel ring buffer for tracing through debugfs, and those interfaces even allow growing the kernel ring buffer per CPU. We also have a static value which can be passed upon boot. Relying on debugfs however is not ideal for production, and relying on the value passed upon bootup is can only used *after* an issue has creeped up. Instead of being reactive this adds a proactive measure which lets you scale the amount of contributions you'd expect to the kernel ring buffer under load by each CPU in the worst case scenario. We use num_possible_cpus() to avoid complexities which could be introduced by dynamically changing the ring buffer size at run time, num_possible_cpus() lets us use the upper limit on possible number of CPUs therefore avoiding having to deal with hotplugging CPUs on and off. This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT which is used to specify the maximum amount of contributions to the kernel ring buffer in the worst case before the kernel ring buffer flips over, the size is specified as a power of 2. The total amount of contributions made by each CPU must be greater than half of the default kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger an increase upon bootup. The kernel ring buffer is increased to the next power of two that would fit the required minimum kernel ring buffer size plus the additional CPU contribution. For example if LOG_BUF_SHIFT is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs in order to trigger an increase of the kernel ring buffer. With a LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64 possible CPUs to trigger an increase. If you had 128 possible CPUs the amount of minimum required kernel ring buffer bumps to: ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB Since we require the ring buffer to be a power of two the new required size would be 1024 KB. This CPU contributions are ignored when the "log_buf_len" kernel parameter is used as it forces the exact size of the ring buffer to an expected power of two value. [pmladek@suse.cz: fix build] Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.cz> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Tested-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: Arun KS <arunks.linux@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* printk: make dynamic units clear for the kernel ring bufferLuis R. Rodriguez2014-08-071-1/+1
| | | | | | | | | | | | | | | | | | Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Suggested-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: Arun KS <arunks.linux@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* printk: move power of 2 practice of ring buffer size to a helperLuis R. Rodriguez2014-08-071-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In practice the power of 2 practice of the size of the kernel ring buffer remains purely historical but not a requirement, specially now that we have LOG_ALIGN and use it for both static and dynamic allocations. It could have helped with implicit alignment back in the days given the even the dynamically sized ring buffer was guaranteed to be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a __LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would be allowed only if it was > __LOG_BUF_LEN and we always ended up rounding the log_buf_len=n to the next power of 2 with roundup_pow_of_two(), any multiple of 2 then should be also architecture aligned. These assumptions of course relied heavily on CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always change this. We now have precise alignment requirements set for the log buffer size for both static and dynamic allocations, but lets upkeep the old practice of using powers of 2 for its size to help with easy expected scalable values and the allocators for dynamic allocations. We'll reuse this later so move this into a helper. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: Arun KS <arunks.linux@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* printk: make dynamic kernel ring buffer alignment explicitLuis R. Rodriguez2014-08-071-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have to consider alignment for the ring buffer both for the default static size, and then also for when an dynamic allocation is made when the log_buf_len=n kernel parameter is passed to set the size specifically to a size larger than the default size set by the architecture through CONFIG_LOG_BUF_SHIFT. The default static kernel ring buffer can be aligned properly if architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned value it can be reduced to a non aligned value. Commit 6ebb017de9 ("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew Lunn ensures the static buffer is always aligned and the decision of alignment is done by the compiler by using __alignof__(struct log). When log_buf_len=n is used we allocate the ring buffer dynamically. Dynamic allocation varies, for the early allocation called before setup_arch() memblock_virt_alloc() requests a page aligment and for the default kernel allocation memblock_virt_alloc_nopanic() requests no special alignment, which in turn ends up aligning the allocation to SMP_CACHE_BYTES, which is L1 cache aligned. Since we already have the required alignment for the kernel ring buffer though we can do better and request explicit alignment for LOG_ALIGN. This does that to be safe and make dynamic allocation alignment explicit. Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Tested-by: Petr Mladek <pmladek@suse.cz> Acked-by: Petr Mladek <pmladek@suse.cz> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: Arun KS <arunks.linux@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/byteorder/generic.h: minor comment fixGeoff Levand2014-08-071-1/+1
| | | | | | Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs.h, drivers/hwmon/asus_atk0110.c: fix DEFINE_SIMPLE_ATTRIBUTE semicolon ↵Joe Perches2014-08-072-2/+2
| | | | | | | | | | | | | definition and use The DEFINE_SIMPLE_ATTRIBUTE macro should not end in a ; Fix the one use in the kernel tree that did not have a semicolon. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Luca Tettamanti <kronos.it@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ./Makefile: tell gcc optimizer to never introduce new data racesJiri Kosina2014-08-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have been chasing a memory corruption bug, which turned out to be caused by very old gcc (4.3.4), which happily turned conditional load into a non-conditional one, and that broke correctness (the condition was met only if lock was held) and corrupted memory. This particular problem with that particular code did not happen when never gccs were used. I've brought this up with our gcc folks, as I wanted to make sure that this can't really happen again, and it turns out it actually can. Quoting Martin Jambor <mjambor@suse.cz>: "More current GCCs are more careful when it comes to replacing a conditional load with a non-conditional one, most notably they check that a store happens in each iteration of _a_ loop but they assume loops are executed. They also perform a simple check whether the store cannot trap which currently passes only for non-const variables. A simple testcase demonstrating it on an x86_64 is for example the following: $ cat cond_store.c int g_1 = 1; int g_2[1024] __attribute__((section ("safe_section"), aligned (4096))); int c = 4; int __attribute__ ((noinline)) foo (void) { int l; for (l = 0; (l != 4); l++) { if (g_1) return l; for (g_2[0] = 0; (g_2[0] >= 26); ++g_2[0]) ; } return 2; } int main (int argc, char* argv[]) { if (mprotect (g_2, sizeof(g_2), PROT_READ) == -1) { int e = errno; error (e, e, "mprotect error %i", e); } foo (); __builtin_printf("OK\n"); return 0; } /* EOF */ $ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=0 $ ./a.out OK $ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=1 $ ./a.out Segmentation fault The testcase fails the same at least with 4.9, 4.8 and 4.7. Therefore I would suggest building kernels with this parameter set to zero. I also agree with Jikos that the default should be changed for -O2. I have run most of the SPEC 2k6 CPU benchmarks (gamess and dealII failed, at -O2, not sure why) compiled with and without this option and did not see any real difference between respective run-times" Hopefully the default will be changed in newer gccs, but let's force it for kernel builds so that we are on a safe side even when older gcc are used. The code in question was out-of-tree printk-in-NMI (yeah, surprise suprise, once again) patch written by Petr Mladek, let me quote his comment from our internal bugzilla: "I have spent few days investigating inconsistent state of kernel ring buffer. It went out that it was caused by speculative store generated by gcc-4.3.4. The problem is in assembly generated for make_free_space(). The functions is called the following way: + vprintk_emit(); + log = MAIN_LOG; // with logbuf_lock or log = NMI_LOG; // with nmi_logbuf_lock cont_add(log, ...); + cont_flush(log, ...); + log_store(log, ...); + log_make_free_space(log, ...); If called with log = NMI_LOG then only nmi_log_* global variables are safe to modify but the generated code does store also into (main_)log_* global variables: <log_make_free_space>: 55 push %rbp 89 f6 mov %esi,%esi 48 8b 05 03 99 51 01 mov 0x1519903(%rip),%rax # ffffffff82620868 <nmi_log_next_id> 44 8b 1d ec 98 51 01 mov 0x15198ec(%rip),%r11d # ffffffff82620858 <log_next_idx> 8b 35 36 60 14 01 mov 0x1146036(%rip),%esi # ffffffff8224cfa8 <log_buf_len> 44 8b 35 33 60 14 01 mov 0x1146033(%rip),%r14d # ffffffff8224cfac <nmi_log_buf_len> 4c 8b 2d d0 98 51 01 mov 0x15198d0(%rip),%r13 # ffffffff82620850 <log_next_seq> 4c 8b 25 11 61 14 01 mov 0x1146111(%rip),%r12 # ffffffff8224d098 <log_buf> 49 89 c2 mov %rax,%r10 48 21 c2 and %rax,%rdx 48 8b 1d 0c 99 55 01 mov 0x155990c(%rip),%rbx # ffffffff826608a0 <nmi_log_buf> 49 c1 ea 20 shr $0x20,%r10 48 89 55 d0 mov %rdx,-0x30(%rbp) 44 29 de sub %r11d,%esi 45 29 d6 sub %r10d,%r14d 4c 8b 0d 97 98 51 01 mov 0x1519897(%rip),%r9 # ffffffff82620840 <log_first_seq> eb 7e jmp ffffffff81107029 <log_make_free_space+0xe9> [...] 85 ff test %edi,%edi # edi = 1 for NMI_LOG 4c 89 e8 mov %r13,%rax 4c 89 ca mov %r9,%rdx 74 0a je ffffffff8110703d <log_make_free_space+0xfd> 8b 15 27 98 51 01 mov 0x1519827(%rip),%edx # ffffffff82620860 <nmi_log_first_id> 48 8b 45 d0 mov -0x30(%rbp),%rax 48 39 c2 cmp %rax,%rdx # end of loop 0f 84 da 00 00 00 je ffffffff81107120 <log_make_free_space+0x1e0> [...] 85 ff test %edi,%edi # edi = 1 for NMI_LOG 4c 89 0d 17 97 51 01 mov %r9,0x1519717(%rip) # ffffffff82620840 <log_first_seq> ^^^^^^^^^^^^^^^^^^^^^^^^^^ KABOOOM 74 35 je ffffffff81107160 <log_make_free_space+0x220> It stores log_first_seq when edi == NMI_LOG. This instructions are used also when edi == MAIN_LOG but the store is done speculatively before the condition is decided. It is unsafe because we do not have "logbuf_lock" in NMI context and some other process migh modify "log_first_seq" in parallel" I believe that the best course of action is both - building kernel (and anything multi-threaded, I guess) with that optimization turned off - persuade gcc folks to change the default for future releases Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Martin Jambor <mjambor@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Marek Polacek <polacek@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Steven Noonan <steven@uplinklabs.net> Cc: Richard Biener <richard.guenther@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zpool: update zswap to use zpoolDan Streetman2014-08-072-31/+46
| | | | | | | | | | | | | | Change zswap to use the zpool api instead of directly using zbud. Add a boot-time param to allow selecting which zpool implementation to use, with zbud as the default. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zpool: zbud/zsmalloc implement zpoolDan Streetman2014-08-072-0/+179
| | | | | | | | | | | | | Update zbud and zsmalloc to implement the zpool api. [fengguang.wu@intel.com: make functions static] Signed-off-by: Dan Streetman <ddstreet@ieee.org> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zpool: implement common zpool api to zbud/zsmallocDan Streetman2014-08-075-18/+495
| | | | | | | | | | | | | | | | | | Add zpool api. zpool provides an interface for memory storage, typically of compressed memory. Users can select what backend to use; currently the only implementations are zbud, a low density implementation with up to two compressed pages per storage page, and zsmalloc, a higher density implementation with multiple compressed pages per storage page. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zbud: change zbud_alloc size type to size_tDan Streetman2014-08-072-3/+3
| | | | | | | | | | | | | | | | | | | Change the type of the zbud_alloc() size param from unsigned int to size_t. Technically, this should not make any difference, as the zbud implementation already restricts the size to well within either type's limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use size_t, this brings the size parameter type in line with zsmalloc/zpool. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Tested-by: Seth Jennings <sjennings@variantweb.net> Cc: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zram: replace global tb_lock with fine grain lockWeijie Yang2014-08-072-33/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we use a rwlock tb_lock to protect concurrent access to the whole zram meta table. However, according to the actual access model, there is only a small chance for upper user to access the same table[index], so the current lock granularity is too big. The idea of optimization is to change the lock granularity from whole meta table to per table entry (table -> table[index]), so that we can protect concurrent access to the same table[index], meanwhile allow the maximum concurrency. With this in mind, several kinds of locks which could be used as a per-entry lock were tested and compared: Test environment: x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04, kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO. iozone test: iozone -t 4 -R -r 16K -s 200M -I +Z (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------------- Initial write 1381094 1425435 1422860 1423075 1421521 Rewrite 1529479 1641199 1668762 1672855 1654910 Read 8468009 11324979 11305569 11117273 10997202 Re-read 8467476 11260914 11248059 11145336 10906486 Reverse Read 6821393 8106334 8282174 8279195 8109186 Stride read 7191093 8994306 9153982 8961224 9004434 Random read 7156353 8957932 9167098 8980465 8940476 Mixed workload 4172747 5680814 5927825 5489578 5972253 Random write 1483044 1605588 1594329 1600453 1596010 Pwrite 1276644 1303108 1311612 1314228 1300960 Pread 4324337 4632869 4618386 4457870 4500166 To enhance the possibility of access the same table[index] concurrently, set zram a small disksize(10MB) and let threads run with large loop count. fio test: fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4 --filename=/dev/zram0 --name=seq-write --rw=write --stonewall --name=seq-read --rw=read --stonewall --name=seq-readwrite --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall (10MB zram raw block device, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------- seq-write 933789 999357 1003298 995961 1001958 seq-read 5634130 6577930 6380861 6243912 6230006 seq-rw 1405687 1638117 1640256 1633903 1634459 rand-rw 1386119 1614664 1617211 1609267 1612471 All the optimization methods show a higher performance than the base, however, it is hard to say which method is the most appropriate. On the other hand, zram is mostly used on small embedded system, so we don't want to increase any memory footprint. This patch pick the bit_spinlock method, pack object size and page_flag into an unsigned long table.value, so as to not increase any memory overhead on both 32-bit and 64-bit system. On the third hand, even though different kinds of locks have different performances, we can ignore this difference, because: if zram is used as zram swapfile, the swap subsystem can prevent concurrent access to the same swapslot; if zram is used as zram-blk for set up filesystem on it, the upper filesystem and the page cache also prevent concurrent access of the same block mostly. So we can ignore the different performances among locks. Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zram: use size_t instead of u16Minchan Kim2014-08-071-1/+1
| | | | | | | | | | | | | | Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16 or more. In these cases u16 is not sufficiently large to represent a compressed page's size so use size_t. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zram: remove unused SECTOR_SIZE defineSergey Senozhatsky2014-08-071-1/+0
| | | | | | | | | | | Drop SECTOR_SIZE define, because it's not used. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zram: rename struct `table' to `zram_table_entry'Sergey Senozhatsky2014-08-071-2/+2
| | | | | | | | | | | | Andrew Morton has recently noted that `struct table' actually represents table entry and, thus, should be renamed. Rename to `zram_table_entry'. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/highmem: make kmap cache coloring awareMax Filippov2014-08-071-11/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | User-visible effect: Architectures that choose this method of maintaining cache coherency (MIPS and xtensa currently) are able to use high memory on cores with aliasing data cache. Without this fix such architectures can not use high memory (in case of xtensa it means that at most 128 MBytes of physical memory is available). The problem: VIPT cache with way size larger than MMU page size may suffer from aliasing problem: a single physical address accessed via different virtual addresses may end up in multiple locations in the cache. Virtual mappings of a physical address that always get cached in different cache locations are said to have different colors. L1 caching hardware usually doesn't handle this situation leaving it up to software. Software must avoid this situation as it leads to data corruption. What can be done: One way to handle this is to flush and invalidate data cache every time page mapping changes color. The other way is to always map physical page at a virtual address with the same color. Low memory pages already have this property. Giving architecture a way to control color of high memory page mapping allows reusing of existing low memory cache alias handling code. How this is done with this patch: Provide hooks that allow architectures with aliasing cache to align mapping address of high pages according to their color. Such architectures may enforce similar coloring of low- and high-memory page mappings and reuse existing cache management functions to support highmem. This code is based on the implementation of similar feature for MIPS by Leonid Yegoshin. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com> Cc: Chris Zankel <chris@zankel.net> Cc: Marc Gauthier <marc@cadence.com> Cc: David Rientjes <rientjes@google.com> Cc: Steven Hill <Steven.Hill@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mmu_notifier: add call_srcu and sync function for listener to delay call and ↵Peter Zijlstra2014-08-072-1/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sync When kernel device drivers or subsystems want to bind their lifespan to t= he lifespan of the mm_struct, they usually use one of the following methods: 1. Manually calling a function in the interested kernel module. The funct= ion call needs to be placed in mmput. This method was rejected by several ker= nel maintainers. 2. Registering to the mmu notifier release mechanism. The problem with the latter approach is that the mmu_notifier_release cal= lback is called from__mmu_notifier_release (called from exit_mmap). That functi= on iterates over the list of mmu notifiers and don't expect the release call= back function to remove itself from the list. Therefore, the callback function= in the kernel module can't release the mmu_notifier_object, which is actuall= y the kernel module's object itself. As a result, the destruction of the kernel module's object must to be done in a delayed fashion. This patch adds support for this delayed callback, by adding a new mmu_notifier_call_srcu function that receives a function ptr and calls th= at function with call_srcu. In that function, the kernel module releases its object. To use mmu_notifier_call_srcu, the calling module needs to call b= efore that a new function called mmu_notifier_unregister_no_release that as its= name implies, unregisters a notifier without calling its notifier release call= back. This patch also adds a function that will call barrier_srcu so those kern= el modules can sync with mmu_notifier. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Oded Gabbay <oded.gabbay@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: BUG when __kmap_atomic_idx equals KM_TYPE_NRChintan Pandya2014-08-071-1/+1
| | | | | | | | | | | | | | | __kmap_atomic_idx is per_cpu variable. Each CPU can use KM_TYPE_NR entries from FIXMAP i.e. from 0 to KM_TYPE_NR - 1. Allowing __kmap_atomic_idx to over- shoot to KM_TYPE_NR can mess up with next CPU's 0th entry which is a bug. Hence BUG_ON if __kmap_atomic_idx >= KM_TYPE_NR. Fix the off-by-on in this test. Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: clean up reclaim size variable use in try_charge()Johannes Weiner2014-08-071-3/+3
| | | | | | | | | | | | | | Charge reclaim and OOM currently use the charge batch variable, but batching is already disabled at that point. To simplify the charge logic, the batch variable is reset to the original request size when reclaim is entered, so it's functionally equal, but it's misleading. Switch reclaim/OOM to nr_pages, which is the original request size. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel/smp.c:on_each_cpu_cond(): fix warning in fallback pathSasha Levin2014-08-071-1/+1
| | | | | | | | | | | | | | | | The rarely-executed memry-allocation-failed callback path generates a WARN_ON_ONCE() when smp_call_function_single() succeeds. Presumably it's supposed to warn on failures. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <htejun@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: change confusing #ifdef use in __access_remote_vmRik van Riel2014-08-071-2/+4
| | | | | | | | | | | | This patch changes confusing #ifdef use in __access_remote_vm into merely ugly #ifdef use. Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651 Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: softdirty: respect VM_SOFTDIRTY in PTE holesPeter Feiner2014-08-071-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap should report that the VMA's virtual pages are soft-dirty until VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to /proc/pid/clear_refs). However, pagemap ignores the VM_SOFTDIRTY flag for virtual addresses that fall in PTE holes (i.e., virtual addresses that don't have a PMD, PUD, or PGD allocated yet). To observe this bug, use mmap to create a VMA large enough such that there's a good chance that the VMA will occupy an unused PMD, then test the soft-dirty bit on its pages. In practice, I found that a VMA that covered a PMD's worth of address space was big enough. This patch adds the necessary VMA lookup to the PTE hole callback in /proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs' VM_SOFTDIRTY flag. Signed-off-by: Peter Feiner <pfeiner@google.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: mark fault_around_bytes __read_mostlyKirill A. Shutemov2014-08-071-1/+2
| | | | | | | | | | | | | | fault_around_bytes can only be changed via debugfs. Let's mark it read-mostly. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: close race between do_fault_around() and fault_around_bytes_set()Kirill A. Shutemov2014-08-071-14/+7
| | | | | | | | | | | | | | | | | | | | Things can go wrong if fault_around_bytes will be changed under do_fault_around(): between fault_around_mask() and fault_around_pages(). Let's read fault_around_bytes only once during do_fault_around() and calculate mask based on the reading. Note: fault_around_bytes can only be updated via debug interface. Also I've tried but was not able to trigger a bad behaviour without the patch. So I would not consider this patch as urgent. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg, vmscan: Fix forced scan of anonymous pagesJerome Marchand2014-08-071-7/+13
| | | | | | | | | | | | | | | | | | | | | | | | | When memory cgoups are enabled, the code that decides to force to scan anonymous pages in get_scan_count() compares global values (free, high_watermark) to a value that is restricted to a memory cgroup (file). It make the code over-eager to force anon scan. For instance, it will force anon scan when scanning a memcg that is mainly populated by anonymous page, even when there is plenty of file pages to get rid of in others memcgs, even when swappiness == 0. It breaks user's expectation about swappiness and hurts performance. This patch makes sure that forced anon scan only happens when there not enough file pages for the all zone, not just in one random memcg. [hannes@cmpxchg.org: cleanups] Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, vmscan: fix an outdated comment still mentioning get_scan_ratioJerome Marchand2014-08-071-1/+1
| | | | | | | | | | | | | | Quite a while ago, get_scan_ratio() has been renamed get_scan_count(), however a comment in shrink_active_list() still mention it. This patch fixes the outdated comment. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, oom: remove unnecessary exit_state checkDavid Rientjes2014-08-072-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix potential infinite loop in dissolve_free_huge_pages()Li Zhong2014-08-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to 0 to indicate huge pages not supported. When this is the case, hugetlbfs could be disabled during boot time: hugetlbfs: disabling because there are no supported hugepage sizes Then in dissolve_free_huge_pages(), order is kept maximum (64 for 64bits), and the for loop below won't end: for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order) As suggested by Naoya, below fix checks hugepages_supported() before calling dissolve_free_huge_pages(). [rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()] Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, thp: restructure thp avoidance of light synchronous migrationDavid Rientjes2014-08-071-8/+9
| | | | | | | | | | | | | | | | | | | | | | | __GFP_NO_KSWAPD, once the way to determine if an allocation was for thp or not, has gained more users. Their use is not necessarily wrong, they are trying to do a memory allocation that can easily fail without disturbing kswapd, so the bit has gained additional usecases. This restructures the check to determine whether MIGRATE_SYNC_LIGHT should be used for memory compaction in the page allocator. Rather than testing solely for __GFP_NO_KSWAPD, test for all bits that must be set for thp allocations. This also moves the check to be done only after the page allocator is aborted for deferred or contended memory compaction since setting migration_mode for this case is pointless. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, oom: rename zonelist locking functionsDavid Rientjes2014-08-073-22/+18
| | | | | | | | | | | | | | | | | | | | | | | try_set_zonelist_oom() and clear_zonelist_oom() are not named properly to imply that they require locking semantics to avoid out_of_memory() being reordered. zone_scan_lock is required for both functions to ensure that there is proper locking synchronization. Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper locking semantics. At the same time, convert oom_zonelist_trylock() to return bool instead of int since only success and failure are tested. Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, oom: ensure memoryless node zonelist always includes zonesDavid Rientjes2014-08-073-3/+12
| | | | | | | | | | | | | | | | | | | | With memoryless node support being worked on, it's possible that for optimizations that a node may not have a non-NULL zonelist. When CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist for first_online_node may become NULL. The oom killer requires a zonelist that includes all memory zones for the sysrq trigger and pagefault out of memory handler. Ensure that a non-NULL zonelist is always passed to the oom killer. [akpm@linux-foundation.org: fix non-numa build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-hotplug: sh: suitable memory should go to ZONE_MOVABLEWang Nan2014-08-071-2/+3
| | | | | | | | | | | | | | | | | | | | | This patch introduces zone_for_memory() to arch_add_memory() on sh to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-hotplug: ppc: suitable memory should go to ZONE_MOVABLEWang Nan2014-08-071-1/+2
| | | | | | | | | | | | | | | | | | | | | This patch introduces zone_for_memory() to arch_add_memory() on powerpc to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-hotplug: ia64: suitable memory should go to ZONE_MOVABLEWang Nan2014-08-071-1/+2
| | | | | | | | | | | | | | | | | | | | | This patch introduces zone_for_memory() to arch_add_memory() on ia64 to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-hotplug: x86_32: suitable memory should go to ZONE_MOVABLEWang Nan2014-08-071-1/+2
| | | | | | | | | | | | | | | | | | | | | This patch introduces zone_for_memory() to arch_add_memory() on x86_32 to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-hotplug: x86_64: suitable memory should go to ZONE_MOVABLEWang Nan2014-08-071-1/+2
| | | | | | | | | | | | | | | | | | | | | This patch introduces zone_for_memory() to arch_add_memory() on x86_64 to ensure new, higher memory added into ZONE_MOVABLE if movable zone has already setup. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-hotplug: add zone_for_memory() for selecting zone for new memoryWang Nan2014-08-072-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This series of patches fixes a problem when adding memory in bad manner. For example: for a x86_64 machine booted with "mem=400M" and with 2GiB memory installed, following commands cause problem: # echo 0x40000000 > /sys/devices/system/memory/probe [ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff] # echo 0x48000000 > /sys/devices/system/memory/probe [ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff] # echo online_movable > /sys/devices/system/memory/memory9/state # echo 0x50000000 > /sys/devices/system/memory/probe [ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff] # echo 0x58000000 > /sys/devices/system/memory/probe [ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff] # echo online_movable > /sys/devices/system/memory/memory11/state # echo online> /sys/devices/system/memory/memory8/state # echo online> /sys/devices/system/memory/memory10/state # echo offline> /sys/devices/system/memory/memory9/state [ 30.558819] Offlined Pages 32768 # free total used free shared buffers cached Mem: 780588 18014398509432020 830552 0 0 51180 -/+ buffers/cache: 18014398509380840 881732 Swap: 0 0 0 This is because the above commands probe higher memory after online a section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE. After the second online_movable, the problem can be observed from zoneinfo: # cat /proc/zoneinfo ... Node 0, zone Movable pages free 65491 min 250 low 312 high 375 scanned 0 spanned 18446744073709518848 present 65536 managed 65536 ... This series of patches solve the problem by checking ZONE_MOVABLE when choosing zone for new memory. If new memory is inside or higher than ZONE_MOVABLE, makes it go there instead. After applying this series of patches, following are free and zoneinfo result (after offlining memory9): bash-4.2# free total used free shared buffers cached Mem: 780956 80112 700844 0 0 51180 -/+ buffers/cache: 28932 752024 Swap: 0 0 0 bash-4.2# cat /proc/zoneinfo Node 0, zone DMA pages free 3389 min 14 low 17 high 21 scanned 0 spanned 4095 present 3998 managed 3977 nr_free_pages 3389 ... start_pfn: 1 inactive_ratio: 1 Node 0, zone DMA32 pages free 73724 min 341 low 426 high 511 scanned 0 spanned 98304 present 98304 managed 92958 nr_free_pages 73724 ... start_pfn: 4096 inactive_ratio: 1 Node 0, zone Normal pages free 32630 min 120 low 150 high 180 scanned 0 spanned 32768 present 32768 managed 32768 nr_free_pages 32630 ... start_pfn: 262144 inactive_ratio: 1 Node 0, zone Movable pages free 65476 min 241 low 301 high 361 scanned 0 spanned 98304 present 65536 managed 65536 nr_free_pages 65476 ... start_pfn: 294912 inactive_ratio: 1 This patch (of 7): Introduce zone_for_memory() in arch independent code for arch_add_memory() use. Many arch_add_memory() function simply selects ZONE_HIGHMEM or ZONE_NORMAL and add new memory into it. However, with the existance of ZONE_MOVABLE, the selection method should be carefully considered: if new, higher memory is added after ZONE_MOVABLE is setup, the default zone and ZONE_MOVABLE may overlap each other. should_add_memory_movable() checks the status of ZONE_MOVABLE. If it has already contain memory, compare the address of new memory and movable memory. If new memory is higher than movable, it should be added into ZONE_MOVABLE instead of default zone. Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: "Mel Gorman" <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slub: remove kmemcg id from create_unique_idVladimir Davydov2014-08-071-6/+0
| | | | | | | | | | | | | | | This function is never called for memcg caches, because they are unmergeable, so remove the dead code. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, writeback: prevent race when calculating dirty limitsDavid Rientjes2014-08-071-4/+1
| | | | | | | | | | | | | | | | | | Setting vm_dirty_bytes and dirty_background_bytes is not protected by any serialization. Therefore, it's possible for either variable to change value after the test in global_dirty_limits() to determine whether available_memory needs to be initialized or not. Always ensure that available_memory is properly initialized. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_modeDavid Rientjes2014-08-071-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") improved the previous khugepaged logic which allocated a transparent hugepages from the node of the first page being collapsed. However, it is still possible to collapse pages to remote memory which may suffer from additional access latency. With the current policy, it is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely if the majority are allocated from that node. When zone_reclaim_mode is enabled, it means the VM should make every attempt to allocate locally to prevent NUMA performance degradation. In this case, we do not want to collapse hugepages to remote nodes that would suffer from increased access latency. Thus, when zone_reclaim_mode is enabled, only allow collapsing to nodes with RECLAIM_DISTANCE or less. There is no functional change for systems that disable zone_reclaim_mode. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/shmem.c: remove the unused gfp arg to shmem_add_to_page_cache()Wang Sheng-Hui2014-08-071-4/+4
| | | | | | | | | | The gfp arg is not used in shmem_add_to_page_cache. Remove this unused arg. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: describe mmap_sem rules for __lock_page_or_retry() and callersPaul Cassella2014-08-076-8/+82
| | | | | | | | | | | | | | | | Add a comment describing the circumstances in which __lock_page_or_retry() will or will not release the mmap_sem when returning 0. Add comments to lock_page_or_retry()'s callers (filemap_fault(), do_swap_page()) noting the impact on VM_FAULT_RETRY returns. Add comments on up the call tree, particularly replacing the false "We return with mmap_sem still held" comments. Signed-off-by: Paul Cassella <cassella@cray.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_alloc: reduce cost of the fair zone allocation policyMel Gorman2014-08-072-48/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fair zone allocation policy round-robins allocations between zones within a node to avoid age inversion problems during reclaim. If the first allocation fails, the batch counts are reset and a second attempt made before entering the slow path. One assumption made with this scheme is that batches expire at roughly the same time and the resets each time are justified. This assumption does not hold when zones reach their low watermark as the batches will be consumed at uneven rates. Allocation failure due to watermark depletion result in additional zonelist scans for the reset and another watermark check before hitting the slowpath. On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA machine it's variable due to the variability of measuring overhead with the vmstat changes. The system CPU overhead comparison looks like 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanilla vmstat-v5 lowercost-v5 User 746.94 774.56 802.00 System 65336.22 32847.27 40852.33 Elapsed 27553.52 27415.04 27368.46 However it is worth noting that the overall benchmark still completed faster and intuitively it makes sense to take as few passes as possible through the zonelists. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_alloc: abort fair zone allocation policy when remotes nodes are ↵Mel Gorman2014-08-071-1/+1
| | | | | | | | | | | | | | | | | | | encountered The purpose of numa_zonelist_order=zone is to preserve lower zones for use with 32-bit devices. If locality is preferred then the numa_zonelist_order=node policy should be used. Unfortunately, the fair zone allocation policy overrides this by skipping zones on remote nodes until the lower one is found. While this makes sense from a page aging and performance perspective, it breaks the expected zonelist policy. This patch restores the expected behaviour for zone-list ordering. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: only update per-cpu thresholds for online CPUMel Gorman2014-08-071-1/+1
| | | | | | | | | | | | | When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered to get more accurate counts to avoid breaching watermarks. This threshold update iterates over all possible CPUs which is unnecessary. Only online CPUs need to be updated. If a new CPU is onlined, refresh_zone_stat_thresholds() will set the thresholds correctly. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move zone->pages_scanned into a vmstat counterMel Gorman2014-08-074-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | zone->pages_scanned is a write-intensive cache line during page reclaim and it's also updated during page free. Move the counter into vmstat to take advantage of the per-cpu updates and do not update it in the free paths unless necessary. On a small UMA machine running tiobench the difference is marginal. On a 4-node machine the overhead is more noticable. Note that automatic NUMA balancing was disabled for this test as otherwise the system CPU overhead is unpredictable. 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5 vmstat-v5 User 746.94 759.78 774.56 System 65336.22 58350.98 32847.27 Elapsed 27553.52 27282.02 27415.04 Note that the overhead reduction will vary depending on where exactly pages are allocated and freed. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: rearrange zone fields into read-only, page alloc, statistics and page ↵Mel Gorman2014-08-073-109/+113
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reclaim lines The arrangement of struct zone has changed over time and now it has reached the point where there is some inappropriate sharing going on. On x86-64 for example o The zone->node field is shared with the zone lock and zone->node is accessed frequently from the page allocator due to the fair zone allocation policy. o span_seqlock is almost never used by shares a line with free_area o Some zone statistics share a cache line with the LRU lock so reclaim-intensive and allocator-intensive workloads can bounce the cache line on a stat update This patch rearranges struct zone to put read-only and read-mostly fields together and then splits the page allocator intensive fields, the zone statistics and the page reclaim intensive fields into their own cache lines. Note that the type of lowmem_reserve changes due to the watermark calculations being signed and avoiding a signed/unsigned conversion there. On the test configuration I used the overall size of struct zone shrunk by one cache line. On smaller machines, this is not likely to be noticable. However, on a 4-node NUMA machine running tiobench the system CPU overhead is reduced by this patch. 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5r9 User 746.94 759.78 System 65336.22 58350.98 Elapsed 27553.52 27282.02 Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: pagemap: avoid unnecessary overhead when tracepoints are deactivatedMel Gorman2014-08-072-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was formerly the series "Improve sequential read throughput" which noted some major differences in performance of tiobench since 3.0. While there are a number of factors, two that dominated were the introduction of the fair zone allocation policy and changes to CFQ. The behaviour of fair zone allocation policy makes more sense than tiobench as a benchmark and CFQ defaults were not changed due to insufficient benchmarking. This series is what's left. It's one functional fix to the fair zone allocation policy when used on NUMA machines and a reduction of overhead in general. tiobench was used for the comparison despite its flaws as an IO benchmark as in this case we are primarily interested in the overhead of page allocator and page reclaim activity. On UMA, it makes little difference to overhead 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 383.61 386.77 System 403.83 401.74 Elapsed 5411.50 5413.11 On a 4-socket NUMA machine it's a bit more noticable 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 746.94 802.00 System 65336.22 40852.33 Elapsed 27553.52 27368.46 This patch (of 6): The LRU insertion and activate tracepoints take PFN as a parameter forcing the overhead to the caller. Move the overhead to the tracepoint fast-assign method to ensure the cost is only incurred when the tracepoint is active. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: trace-vmscan-postprocess.pl: report the number of file/anon pages ↵Chen Yucong2014-08-071-0/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | respectively Until now, the reporting from trace-vmscan-postprocess.pl is not very useful because we cannot directly use this script for checking the file/anon ratio of scanning. This patch aims to report respectively the number of file/anon pages which were scanned/reclaimed by kswapd or direct-reclaim. Sample output is usually something like the following. Summary Direct reclaims: 8823 Direct reclaim pages scanned: 2438797 Direct reclaim file pages scanned: 1315200 Direct reclaim anon pages scanned: 1123597 Direct reclaim pages reclaimed: 446139 Direct reclaim file pages reclaimed: 378668 Direct reclaim anon pages reclaimed: 67471 Direct reclaim write file sync I/O: 0 Direct reclaim write anon sync I/O: 0 Direct reclaim write file async I/O: 0 Direct reclaim write anon async I/O: 4240 Wake kswapd requests: 122310 Time stalled direct reclaim: 13.78 seconds Kswapd wakeups: 25817 Kswapd pages scanned: 170779115 Kswapd file pages scanned: 162725123 Kswapd anon pages scanned: 8053992 Kswapd pages reclaimed: 129065738 Kswapd file pages reclaimed: 128500930 Kswapd anon pages reclaimed: 564808 Kswapd reclaim write file sync I/O: 0 Kswapd reclaim write anon sync I/O: 0 Kswapd reclaim write file async I/O: 36 Kswapd reclaim write anon async I/O: 730730 Time kswapd awake: 1015.50 seconds Signed-off-by: Chen Yucong <slaoub@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>