diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2017-09-07 05:49:49 +0200 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2017-09-07 05:49:49 +0200 |
commit | d34fc1adf01ff87026da85fb972dc259dc347540 (patch) | |
tree | 27356073d423187157b7cdb69da32b53102fb9e7 /Documentation | |
parent | x86/mm: Document how CR4.PCIDE restore works (diff) | |
parent | mm,fork: introduce MADV_WIPEONFORK (diff) | |
download | linux-d34fc1adf01ff87026da85fb972dc259dc347540.tar.xz linux-d34fc1adf01ff87026da85fb972dc259dc347540.zip |
Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
- various misc bits
- DAX updates
- OCFS2
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
mm,fork: introduce MADV_WIPEONFORK
x86,mpx: make mpx depend on x86-64 to free up VMA flag
mm: add /proc/pid/smaps_rollup
mm: hugetlb: clear target sub-page last when clearing huge page
mm: oom: let oom_reap_task and exit_mmap run concurrently
swap: choose swap device according to numa node
mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
mm, oom: do not rely on TIF_MEMDIE for memory reserves access
z3fold: use per-cpu unbuddied lists
mm, swap: don't use VMA based swap readahead if HDD is used as swap
mm, swap: add sysfs interface for VMA based swap readahead
mm, swap: VMA based swap readahead
mm, swap: fix swap readahead marking
mm, swap: add swap readahead hit statistics
mm/vmalloc.c: don't reinvent the wheel but use existing llist API
mm/vmstat.c: fix wrong comment
selftests/memfd: add memfd_create hugetlbfs selftest
mm/shmem: add hugetlbfs support to memfd_create()
mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/testing/procfs-smaps_rollup | 31 | ||||
-rw-r--r-- | Documentation/ABI/testing/sysfs-block-zram | 8 | ||||
-rw-r--r-- | Documentation/ABI/testing/sysfs-kernel-mm-swap | 26 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 2 | ||||
-rw-r--r-- | Documentation/blockdev/zram.txt | 11 | ||||
-rw-r--r-- | Documentation/filesystems/caching/netfs-api.txt | 2 | ||||
-rw-r--r-- | Documentation/filesystems/dax.txt | 5 | ||||
-rw-r--r-- | Documentation/sysctl/vm.txt | 4 | ||||
-rw-r--r-- | Documentation/vm/numa | 7 | ||||
-rw-r--r-- | Documentation/vm/swap_numa.txt | 69 |
10 files changed, 153 insertions, 12 deletions
diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup new file mode 100644 index 000000000000..0a54ed0d63c9 --- /dev/null +++ b/Documentation/ABI/testing/procfs-smaps_rollup @@ -0,0 +1,31 @@ +What: /proc/pid/smaps_rollup +Date: August 2017 +Contact: Daniel Colascione <dancol@google.com> +Description: + This file provides pre-summed memory information for a + process. The format is identical to /proc/pid/smaps, + except instead of an entry for each VMA in a process, + smaps_rollup has a single entry (tagged "[rollup]") + for which each field is the sum of the corresponding + fields from all the maps in /proc/pid/smaps. + For more details, see the procfs man page. + + Typical output looks like this: + + 00100000-ff709000 ---p 00000000 00:00 0 [rollup] + Rss: 884 kB + Pss: 385 kB + Shared_Clean: 696 kB + Shared_Dirty: 0 kB + Private_Clean: 120 kB + Private_Dirty: 68 kB + Referenced: 884 kB + Anonymous: 68 kB + LazyFree: 0 kB + AnonHugePages: 0 kB + ShmemPmdMapped: 0 kB + Shared_Hugetlb: 0 kB + Private_Hugetlb: 0 kB + Swap: 0 kB + SwapPss: 0 kB + Locked: 385 kB diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 451b6d882b2c..c1513c756af1 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -90,3 +90,11 @@ Description: device's debugging info useful for kernel developers. Its format is not documented intentionally and may change anytime without any notice. + +What: /sys/block/zram<id>/backing_dev +Date: June 2017 +Contact: Minchan Kim <minchan@kernel.org> +Description: + The backing_dev file is read-write and set up backing + device for zram to write incompressible pages. + For using, user should enable CONFIG_ZRAM_WRITEBACK. diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-swap b/Documentation/ABI/testing/sysfs-kernel-mm-swap new file mode 100644 index 000000000000..587db52084c7 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-swap @@ -0,0 +1,26 @@ +What: /sys/kernel/mm/swap/ +Date: August 2017 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: Interface for swapping + +What: /sys/kernel/mm/swap/vma_ra_enabled +Date: August 2017 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: Enable/disable VMA based swap readahead. + + If set to true, the VMA based swap readahead algorithm + will be used for swappable anonymous pages mapped in a + VMA, and the global swap readahead algorithm will be + still used for tmpfs etc. other users. If set to + false, the global swap readahead algorithm will be + used for all swappable pages. + +What: /sys/kernel/mm/swap/vma_ra_max_order +Date: August 2017 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: The max readahead size in order for VMA based swap readahead + + VMA based swap readahead algorithm will readahead at + most 1 << max_order pages for each readahead. The + real readahead size for each readahead will be scaled + according to the estimation algorithm. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6996b7727b85..86b0e8ec8ad7 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2783,7 +2783,7 @@ Allowed values are enable and disable numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. - one of ['zone', 'node', 'default'] can be specified + 'node', 'default' can be specified This can be set from sysctl after boot. See Documentation/sysctl/vm.txt for details. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 4fced8a21307..257e65714c6a 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -168,6 +168,7 @@ max_comp_streams RW the number of possible concurrent compress operations comp_algorithm RW show and change the compression algorithm compact WO trigger memory compaction debug_stat RO this file is used for zram debugging purposes +backing_dev RW set up backend storage for zram to write out User space is advised to use the following files to read the device statistics. @@ -231,5 +232,15 @@ line of text and contains the following stats separated by whitespace: resets the disksize to zero. You must set the disksize again before reusing the device. +* Optional Feature + += writeback + +With incompressible pages, there is no memory saving with zram. +Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page +to backing storage rather than keeping it in memory. +User should set up backing device via /sys/block/zramX/backing_dev +before disksize setting. + Nitin Gupta ngupta@vflare.org diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt index aed6b94160b1..0eb31de3a2c1 100644 --- a/Documentation/filesystems/caching/netfs-api.txt +++ b/Documentation/filesystems/caching/netfs-api.txt @@ -151,8 +151,6 @@ To define an object, a structure of the following type should be filled out: void (*mark_pages_cached)(void *cookie_netfs_data, struct address_space *mapping, struct pagevec *cached_pvec); - - void (*now_uncached)(void *cookie_netfs_data); }; This has the following fields: diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index a7e6e14aeb08..3be3b266be41 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -63,9 +63,8 @@ Filesystem support consists of - implementing an mmap file operation for DAX files which sets the VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These - handlers should probably call dax_iomap_fault() (for fault and page_mkwrite - handlers), dax_iomap_pmd_fault(), dax_pfn_mkwrite() passing the appropriate - iomap operations. + handlers should probably call dax_iomap_fault() passing the appropriate + fault size and iomap operations. - calling iomap_zero_range() passing appropriate iomap operations instead of block_truncate_page() for DAX files - ensuring that there is sufficient locking between reads, writes, diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 48244c42ff52..9baf66a9ef4e 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information. numa_zonelist_order -This sysctl is only for NUMA. +This sysctl is only for NUMA and it is deprecated. Anything but +Node order will fail! + 'where the memory is allocated from' is controlled by zonelists. (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. you may be able to read ZONE_DMA as ZONE_DMA32...) diff --git a/Documentation/vm/numa b/Documentation/vm/numa index a08f71647714..a31b85b9bb88 100644 --- a/Documentation/vm/numa +++ b/Documentation/vm/numa @@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations fall back to the same zone type on a different node, or to a different zone type on the same node. This is an important consideration because some zones, such as DMA or DMA32, represent relatively scarce resources. Linux chooses -a default zonelist order based on the sizes of the various zone types relative -to the total memory of the node and the total memory of the system. The -default zonelist order may be overridden using the numa_zonelist_order kernel -boot parameter or sysctl. [see Documentation/admin-guide/kernel-parameters.rst and -Documentation/sysctl/vm.txt] +a default Node ordered zonelist. This means it tries to fallback to other zones +from the same node before using remote nodes which are ordered by NUMA distance. By default, Linux will attempt to satisfy memory allocation requests from the node to which the CPU that executes the request is assigned. Specifically, diff --git a/Documentation/vm/swap_numa.txt b/Documentation/vm/swap_numa.txt new file mode 100644 index 000000000000..d5960c9124f5 --- /dev/null +++ b/Documentation/vm/swap_numa.txt @@ -0,0 +1,69 @@ +Automatically bind swap device to numa node +------------------------------------------- + +If the system has more than one swap device and swap device has the node +information, we can make use of this information to decide which swap +device to use in get_swap_pages() to get better performance. + + +How to use this feature +----------------------- + +Swap device has priority and that decides the order of it to be used. To make +use of automatically binding, there is no need to manipulate priority settings +for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and +swapB, with swapA attached to node 0 and swapB attached to node 1, are going +to be swapped on. Simply swapping them on by doing: +# swapon /dev/swapA +# swapon /dev/swapB + +Then node 0 will use the two swap devices in the order of swapA then swapB and +node 1 will use the two swap devices in the order of swapB then swapA. Note +that the order of them being swapped on doesn't matter. + +A more complex example on a 4 node machine. Assume 6 swap devices are going to +be swapped on: swapA and swapB are attached to node 0, swapC is attached to +node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. +The way to swap them on is the same as above: +# swapon /dev/swapA +# swapon /dev/swapB +# swapon /dev/swapC +# swapon /dev/swapD +# swapon /dev/swapE +# swapon /dev/swapF + +Then node 0 will use them in the order of: +swapA/swapB -> swapC -> swapD -> swapE -> swapF +swapA and swapB will be used in a round robin mode before any other swap device. + +node 1 will use them in the order of: +swapC -> swapA -> swapB -> swapD -> swapE -> swapF + +node 2 will use them in the order of: +swapD/swapE -> swapA -> swapB -> swapC -> swapF +Similaly, swapD and swapE will be used in a round robin mode before any +other swap devices. + +node 3 will use them in the order of: +swapF -> swapA -> swapB -> swapC -> swapD -> swapE + + +Implementation details +---------------------- + +The current code uses a priority based list, swap_avail_list, to decide +which swap device to use and if multiple swap devices share the same +priority, they are used round robin. This change here replaces the single +global swap_avail_list with a per-numa-node list, i.e. for each numa node, +it sees its own priority based list of available swap devices. Swap +device's priority can be promoted on its matching node's swap_avail_list. + +The current swap device's priority is set as: user can set a >=0 value, +or the system will pick one starting from -1 then downwards. The priority +value in the swap_avail_list is the negated value of the swap device's +due to plist being sorted from low to high. The new policy doesn't change +the semantics for priority >=0 cases, the previous starting from -1 then +downwards now becomes starting from -2 then downwards and -1 is reserved +as the promoted value. So if multiple swap devices are attached to the same +node, they will all be promoted to priority -1 on that node's plist and will +be used round robin before any other swap devices. |