summaryrefslogtreecommitdiffstats
path: root/mm/zsmalloc.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* mm: zswap: remove shrink from zpool interfaceDomenico Cerasuolo2023-06-201-3/+1
| | | | | | | | | | | | | | | | | | Now that all three zswap backends have removed their shrink code, it is no longer necessary for the zpool interface to include shrink/writeback endpoints. Link: https://lkml.kernel.org/r/20230612093815.133504-6-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: zswap: remove page reclaim logic from zsmallocDomenico Cerasuolo2023-06-201-380/+12
| | | | | | | | | | | | | | | | | Switch zsmalloc to the new generic zswap LRU and remove its custom implementation. Link: https://lkml.kernel.org/r/20230612093815.133504-5-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/zsmalloc: get rid of PAGE_MASKAlexey Romanov2023-06-101-6/+6
| | | | | | | | | | Use offset_in_page() macro instead of 'val & ~PAGE_MASK' Link: https://lkml.kernel.org/r/20230516095029.49036-2-avromanov@sberdevices.ru Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: move LRU update from zs_map_object() to zs_malloc()Nhat Pham2023-05-181-27/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under memory pressure, we sometimes observe the following crash: [ 5694.832838] ------------[ cut here ]------------ [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100) [ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80 [ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S 5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1 [ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021 [ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80 [ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7 [ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246 [ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000 [ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480 [ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370 [ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002 [ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240 [ 5695.136717] FS: 00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000 [ 5695.152899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0 [ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5695.207197] PKRU: 55555554 [ 5695.212602] Call Trace: [ 5695.217486] <TASK> [ 5695.221674] zs_map_object+0x91/0x270 [ 5695.229000] zswap_frontswap_store+0x33d/0x870 [ 5695.237885] ? do_raw_spin_lock+0x5d/0xa0 [ 5695.245899] __frontswap_store+0x51/0xb0 [ 5695.253742] swap_writepage+0x3c/0x60 [ 5695.261063] shrink_page_list+0x738/0x1230 [ 5695.269255] shrink_lruvec+0x5ec/0xcd0 [ 5695.276749] ? shrink_slab+0x187/0x5f0 [ 5695.284240] ? mem_cgroup_iter+0x6e/0x120 [ 5695.292255] shrink_node+0x293/0x7b0 [ 5695.299402] do_try_to_free_pages+0xea/0x550 [ 5695.307940] try_to_free_pages+0x19a/0x490 [ 5695.316126] __folio_alloc+0x19ff/0x3e40 [ 5695.323971] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.332681] ? walk_component+0x2a8/0xb50 [ 5695.340697] ? generic_permission+0xda/0x2a0 [ 5695.349231] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.357940] ? walk_component+0x2a8/0xb50 [ 5695.365955] vma_alloc_folio+0x10e/0x570 [ 5695.373796] ? walk_component+0x52/0xb50 [ 5695.381634] wp_page_copy+0x38c/0xc10 [ 5695.388953] ? filename_lookup+0x378/0xbc0 [ 5695.397140] handle_mm_fault+0x87f/0x1800 [ 5695.405157] do_user_addr_fault+0x1bd/0x570 [ 5695.413520] exc_page_fault+0x5d/0x110 [ 5695.421017] asm_exc_page_fault+0x22/0x30 After some investigation, I have found the following issue: unlike other zswap backends, zsmalloc performs the LRU list update at the object mapping time, rather than when the slot for the object is allocated. This deviation was discussed and agreed upon during the review process of the zsmalloc writeback patch series: https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/ Unfortunately, this introduces a subtle bug that occurs when there is a concurrent store and reclaim, which interleave as follows: zswap_frontswap_store() shrink_worker() zs_malloc() zs_zpool_shrink() spin_lock(&pool->lock) zs_reclaim_page() zspage = find_get_zspage() spin_unlock(&pool->lock) spin_lock(&pool->lock) zspage = list_first_entry(&pool->lru) list_del(&zspage->lru) zspage->lru.next = LIST_POISON1 zspage->lru.prev = LIST_POISON2 spin_unlock(&pool->lock) zs_map_object() spin_lock(&pool->lock) if (!list_empty(&zspage->lru)) list_del(&zspage->lru) CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */ With the current upstream code, this issue rarely happens. zswap only triggers writeback when the pool is already full, at which point all further store attempts are short-circuited. This creates an implicit pseudo-serialization between reclaim and store. I am working on a new zswap shrinking mechanism, which makes interleaving reclaim and store more likely, exposing this bug. zbud and z3fold do not have this problem, because they perform the LRU list update in the alloc function, while still holding the pool's lock. This patch fixes the aforementioned bug by moving the LRU update back to zs_malloc(), analogous to zbud and z3fold. Link: https://lkml.kernel.org/r/20230505185054.2417128-1-nphamcs@gmail.com Fixes: 64f768c6b32e ("zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU order") Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: allow only one active pool compaction contextSergey Senozhatsky2023-04-211-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zsmalloc pool can be compacted concurrently by many contexts, e.g. cc1 handle_mm_fault() do_anonymous_page() __alloc_pages_slowpath() try_to_free_pages() do_try_to_free_pages( lru_gen_shrink_node() shrink_slab() do_shrink_slab() zs_shrinker_scan() zs_compact() Pool compaction is currently (basically) single-threaded as it is performed under pool->lock. Having multiple compaction threads results in unnecessary contention, as each thread competes for pool->lock. This, in turn, affects all zsmalloc operations such as zs_malloc(), zs_map_object(), zs_free(), etc. Introduce the pool->compaction_in_progress atomic variable, which ensures that only one compaction context can run at a time. This reduces overall pool->lock contention in (corner) cases when many contexts attempt to shrink zspool simultaneously. Link: https://lkml.kernel.org/r/20230418074639.1903197-1-senozhatsky@chromium.org Fixes: c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: reset compaction source zspage pointer after putback_zspage()Sergey Senozhatsky2023-04-191-1/+1
| | | | | | | | | | | | | | | | | | | | The current implementation of the compaction loop fails to set the source zspage pointer to NULL in all cases, leading to a potential issue where __zs_compact() could use a stale zspage pointer. This pointer could even point to a previously freed zspage, causing unexpected behavior in the putback_zspage() and migrate_write_unlock() functions after returning from the compaction loop. Address the issue by ensuring that the source zspage pointer is always set to NULL when it should be. Link: https://lkml.kernel.org/r/20230417130850.1784777-1-senozhatsky@chromium.org Fixes: 5a845e9f2d66 ("zsmalloc: rework compaction algorithm") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Yu Zhao <yuzhao@google.com> Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: show per fullness group class statsSergey Senozhatsky2023-03-291-30/+23
| | | | | | | | | | | | | We keep the old fullness (3/4 threshold) reporting in zs_stats_size_show(). Switch from allmost full/empty stats to fine-grained per inuse ratio (fullness group) reporting, which gives signicantly more data on classes fragmentation. Link: https://lkml.kernel.org/r/20230304034835.2082479-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: rework compaction algorithmSergey Senozhatsky2023-03-291-42/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zsmalloc compaction algorithm has the potential to waste some CPU cycles, particularly when compacting pages within the same fullness group. This is due to the way it selects the head page of the fullness list for source and destination pages, and how it reinserts those pages during each iteration. The algorithm may first use a page as a migration destination and then as a migration source, leading to an unnecessary back-and-forth movement of objects. Consider the following fullness list: PageA PageB PageC PageD PageE During the first iteration, the compaction algorithm will select PageA as the source and PageB as the destination. All of PageA's objects will be moved to PageB, and then PageA will be released while PageB is reinserted into the fullness list. PageB PageC PageD PageE During the next iteration, the compaction algorithm will again select the head of the list as the source and destination, meaning that PageB will now serve as the source and PageC as the destination. This will result in the objects being moved away from PageB, the same objects that were just moved to PageB in the previous iteration. To prevent this avalanche effect, the compaction algorithm should not reinsert the destination page between iterations. By doing so, the most optimal page will continue to be used and its usage ratio will increase, reducing internal fragmentation. The destination page should only be reinserted into the fullness list if: - It becomes full - No source page is available. TEST ==== It's very challenging to reliably test this series. I ended up developing my own synthetic test that has 100% reproducibility. The test generates significan fragmentation (for each size class) and then performs compaction for each class individually and tracks the number of memcpy() in zs_object_copy(), so that we can compare the amount work compaction does on per-class basis. Total amount of work (zram mm_stat objs_moved) ---------------------------------------------- Old fullness grouping, old compaction algorithm: 323977 memcpy() in zs_object_copy(). Old fullness grouping, new compaction algorithm: 262944 memcpy() in zs_object_copy(). New fullness grouping, new compaction algorithm: 213978 memcpy() in zs_object_copy(). Per-class compaction memcpy() comparison (T-test) ------------------------------------------------- x Old fullness grouping, old compaction algorithm + Old fullness grouping, new compaction algorithm N Min Max Median Avg Stddev x 140 349 3513 2461 2314.1214 806.03271 + 140 289 2778 2006 1878.1714 641.02073 Difference at 95.0% confidence -435.95 +/- 170.595 -18.8387% +/- 7.37193% (Student's t, pooled s = 728.216) x Old fullness grouping, old compaction algorithm + New fullness grouping, new compaction algorithm N Min Max Median Avg Stddev x 140 349 3513 2461 2314.1214 806.03271 + 140 226 2279 1644 1528.4143 524.85268 Difference at 95.0% confidence -785.707 +/- 159.331 -33.9527% +/- 6.88516% (Student's t, pooled s = 680.132) Link: https://lkml.kernel.org/r/20230304034835.2082479-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: fine-grained inuse ratio based fullness groupingSergey Senozhatsky2023-03-291-118/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each zspage maintains ->inuse counter which keeps track of the number of objects stored in the zspage. The ->inuse counter also determines the zspage's "fullness group" which is calculated as the ratio of the "inuse" objects to the total number of objects the zspage can hold (objs_per_zspage). The closer the ->inuse counter is to objs_per_zspage, the better. Each size class maintains several fullness lists, that keep track of zspages of particular "fullness". Pages within each fullness list are stored in random order with regard to the ->inuse counter. This is because sorting the zspages by ->inuse counter each time obj_malloc() or obj_free() is called would be too expensive. However, the ->inuse counter is still a crucial factor in many situations. For the two major zsmalloc operations, zs_malloc() and zs_compact(), we typically select the head zspage from the corresponding fullness list as the best candidate zspage. However, this assumption is not always accurate. For the zs_malloc() operation, the optimal candidate zspage should have the highest ->inuse counter. This is because the goal is to maximize the number of ZS_FULL zspages and make full use of all allocated memory. For the zs_compact() operation, the optimal source zspage should have the lowest ->inuse counter. This is because compaction needs to move objects in use to another page before it can release the zspage and return its physical pages to the buddy allocator. The fewer objects in use, the quicker compaction can release the zspage. Additionally, compaction is measured by the number of pages it releases. This patch reworks the fullness grouping mechanism. Instead of having two groups - ZS_ALMOST_EMPTY (usage ratio below 3/4) and ZS_ALMOST_FULL (usage ration above 3/4) - that result in too many zspages being included in the ALMOST_EMPTY group for specific classes, size classes maintain a larger number of fullness lists that give strict guarantees on the minimum and maximum ->inuse values within each group. Each group represents a 10% change in the ->inuse ratio compared to neighboring groups. In essence, there are groups for zspages with 0%, 10%, 20% usage ratios, and so on, up to 100%. This enhances the selection of candidate zspages for both zs_malloc() and zs_compact(). A printout of the ->inuse counters of the first 7 zspages per (random) class fullness group: class-768 objs_per_zspage 16: fullness 100%: empty fullness 99%: empty fullness 90%: empty fullness 80%: empty fullness 70%: empty fullness 60%: 8 8 9 9 8 8 8 fullness 50%: empty fullness 40%: 5 5 6 5 5 5 5 fullness 30%: 4 4 4 4 4 4 4 fullness 20%: 2 3 2 3 3 2 2 fullness 10%: 1 1 1 1 1 1 1 fullness 0%: empty The zs_malloc() function searches through the groups of pages starting with the one having the highest usage ratio. This means that it always selects a zspage from the group with the least internal fragmentation (highest usage ratio) and makes it even less fragmented by increasing its usage ratio. The zs_compact() function, on the other hand, begins by scanning the group with the highest fragmentation (lowest usage ratio) to locate the source page. The first available zspage is selected, and then the function moves downward to find a destination zspage in the group with the lowest internal fragmentation (highest usage ratio). Link: https://lkml.kernel.org/r/20230304034835.2082479-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: remove insert_zspage() ->inuse optimizationSergey Senozhatsky2023-03-291-12/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "zsmalloc: fine-grained fullness and new compaction algorithm", v4. Existing zsmalloc page fullness grouping leads to suboptimal page selection for both zs_malloc() and zs_compact(). This patchset reworks zsmalloc fullness grouping/classification. Additinally it also implements new compaction algorithm that is expected to use less CPU-cycles (as it potentially does fewer memcpy-s in zs_object_copy()). Test (synthetic) results can be seen in patch 0003. This patch (of 4): This optimization has no effect. It only ensures that when a zspage was added to its corresponding fullness list, its "inuse" counter was higher or lower than the "inuse" counter of the zspage at the head of the list. The intention was to keep busy zspages at the head, so they could be filled up and moved to the ZS_FULL fullness group more quickly. However, this doesn't work as the "inuse" counter of a zspage can be modified by obj_free() but the zspage may still belong to the same fullness list. So, fix_fullness_group() won't change the zspage's position in relation to the head's "inuse" counter, leading to a largely random order of zspages within the fullness list. For instance, consider a printout of the "inuse" counters of the first 10 zspages in a class that holds 93 objects per zspage: ZS_ALMOST_EMPTY: 36 67 68 64 35 54 63 52 As we can see the zspage with the lowest "inuse" counter is actually the head of the fullness list. Remove this pointless "optimisation". Link: https://lkml.kernel.org/r/20230304034835.2082479-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20230304034835.2082479-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: make zspage chain size configurableSergey Senozhatsky2023-02-031-8/+4
| | | | | | | | | | | | | | | | Remove hard coded limit on the maximum number of physical pages per-zspage. This will allow tuning of zsmalloc pool as zspage chain size changes `pages per-zspage` and `objects per-zspage` characteristics of size classes which also affects size classes clustering (the way size classes are merged). Link: https://lkml.kernel.org/r/20230118005210.2814763-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: skip chain size calculation for pow_of_2 classesSergey Senozhatsky2023-02-031-0/+3
| | | | | | | | | | | If a class size is power of 2 then it wastes no memory and the best configuration is 1 physical page per-zspage. Link: https://lkml.kernel.org/r/20230118005210.2814763-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: rework zspage chain size selectionSergey Senozhatsky2023-02-031-37/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "zsmalloc: make zspage chain size configurable". Computers are bad at division. We currently decide the best zspage chain size (max number of physical pages per-zspage) by looking at a `used percentage` value. This is not enough as we lose precision during usage percentage calculations For example, let's look at size class 208: pages per zspage wasted bytes used% 1 144 96 2 80 99 3 16 99 4 160 99 Current algorithm will select 2 page per zspage configuration, as it's the first one to reach 99%. However, 3 pages per zspage waste less memory. Change algorithm and select zspage configuration that has lowest wasted value. Link: https://lkml.kernel.org/r/20230118005210.2814763-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20230118005210.2814763-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Sync mm-stable with mm-hotfixes-stable to pick up dependent patchesAndrew Morton2023-02-011-32/+205
|\ | | | | | | Merge branch 'mm-hotfixes-stable' into mm-stable
| * zsmalloc: fix a race with deferred_handles storingNhat Pham2023-02-011-32/+205
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there is a race between zs_free() and zs_reclaim_page(): zs_reclaim_page() finds a handle to an allocated object, but before the eviction happens, an independent zs_free() call to the same handle could come in and overwrite the object value stored at the handle with the last deferred handle. When zs_reclaim_page() finally gets to call the eviction handler, it will see an invalid object value (i.e the previous deferred handle instead of the original object value). This race happens quite infrequently. We only managed to produce it with out-of-tree developmental code that triggers zsmalloc writeback with a much higher frequency than usual. This patch fixes this race by storing the deferred handle in the object header instead. We differentiate the deferred handle from the other two cases (handle for allocated object, and linkage for free object) with a new tag. If zspage reclamation succeeds, we will free these deferred handles by walking through the zspage objects. On the other hand, if zspage reclamation fails, we reconstruct the zspage freelist (with the deferred handle tag and allocated tag) before trying again with the reclamation. [arnd@arndb.de: avoid unused-function warning] Link: https://lkml.kernel.org/r/20230117170507.2651972-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20230110231701.326724-1-nphamcs@gmail.com Fixes: 9997bc017549 ("zsmalloc: implement writeback mechanism for zsmalloc") Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: remove PageMovable exportGreg Kroah-Hartman2023-01-191-3/+0
|/ | | | | | | | | | | | | | | | | The only in-kernel users that need PageMovable() to be exported are z3fold and zsmalloc and they are only using it for dubious debugging functionality. So remove those usages and the export so that no driver code accidentally thinks that they are allowed to use this symbol. Link: https://lkml.kernel.org/r/20230106135900.3763622-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: implement writeback mechanism for zsmallocNhat Pham2022-12-121-11/+183
| | | | | | | | | | | | | | | | | | This commit adds the writeback mechanism for zsmalloc, analogous to the zbud allocator. Zsmalloc will attempt to determine the coldest zspage (i.e least recently used) in the pool, and attempt to write back all the stored compressed objects via the pool's evict handler. Link: https://lkml.kernel.org/r/20221128191616.1261026-7-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: add zpool_ops field to zs_pool to store evict handlersNhat Pham2022-12-121-1/+10
| | | | | | | | | | | | | | | | This adds a new field to zs_pool to store evict handlers for writeback, analogous to the zbud allocator. Link: https://lkml.kernel.org/r/20221128191616.1261026-6-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU orderNhat Pham2022-12-121-0/+50
| | | | | | | | | | | | | | | This helps determines the coldest zspages as candidates for writeback. Link: https://lkml.kernel.org/r/20221128191616.1261026-5-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: consolidate zs_pool's migrate_lock and size_class's locksNhat Pham2022-12-121-50/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, zsmalloc has a hierarchy of locks, which includes a pool-level migrate_lock, and a lock for each size class. We have to obtain both locks in the hotpath in most cases anyway, except for zs_malloc. This exception will no longer exist when we introduce a LRU into the zs_pool for the new writeback functionality - we will need to obtain a pool-level lock to synchronize LRU handling even in zs_malloc. In preparation for zsmalloc writeback, consolidate these locks into a single pool-level lock, which drastically reduces the complexity of synchronization in zsmalloc. We have also benchmarked the lock consolidation to see the performance effect of this change on zram. First, we ran a synthetic FS workload on a server machine with 36 cores (same machine for all runs), using fs_mark -d ../zram1mnt -s 100000 -n 2500 -t 32 -k before and after for btrfs and ext4 on zram (FS usage is 80%). Here is the result (unit is file/second): With lock consolidation (btrfs): Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028 Without lock consolidation (btrfs): Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665 With lock consolidation (ext4): Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668 Without lock consolidation (ext4) Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469 As you can see, we observe a 0.3% regression for btrfs, and a 0.9% regression for ext4. This is a small, barely measurable difference in my opinion. For a more realistic scenario, we also tries building the kernel on zram. Here is the time it takes (in seconds): With lock consolidation (btrfs): real Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159 user Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656 sys Average: 521.4, Median: 522.0, Stddev: 1.51657508881031 Without lock consolidation (btrfs): real Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756 user Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023 sys Average: 520.6, Median: 521.0, Stddev: 1.140175425099138 With lock consolidation (ext4): real Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951 user Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307 sys Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317 Without lock consolidation (ext4) real Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159 user Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523 sys Average: 520.4, Median: 520.0, Stddev: 1.140175425099138 The difference is entirely within the noise of a typical run on zram. This hardly justifies the complexity of maintaining both the pool lock and the class lock. In fact, for writeback, we would need to introduce yet another lock to prevent data races on the pool's LRU, further complicating the lock handling logic. IMHO, it is just better to collapse all of these into a single pool-level lock. Link: https://lkml.kernel.org/r/20221128191616.1261026-4-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zram: add size class equals check into recompressionAlexey Romanov2022-12-011-0/+21
| | | | | | | | | | | | | | | | | It makes no sense for us to recompress the object if it will be in the same size class. We anyway don't get any memory gain. But, at the same time, we get a CPU time overhead when inserting this object into zspage and decompressing it afterwards. [senozhatsky: rebased and fixed conflicts] Link: https://lkml.kernel.org/r/20221109115047.2921851-9-senozhatsky@chromium.org Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: replace IS_ERR() with IS_ERR_VALUE()Deming Wang2022-12-011-1/+1
| | | | | | | | | Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE() instead. Link: https://lkml.kernel.org/r/20221104023818.1728-1-wangdeming@inspur.com Signed-off-by: Deming Wang <wangdeming@inspur.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: zs_destroy_pool: add size_class NULL checkAlexey Romanov2022-10-211-0/+3
| | | | | | | | | | | | | | | Inside the zs_destroy_pool() function, there can still be NULL size_class pointers: if when the next size_class is allocated, inside zs_create_pool() function, kzalloc will return NULL and handling the error condition, zs_create_pool() will call zs_destroy_pool(). Link: https://lkml.kernel.org/r/20221013112825.61869-1-avromanov@sberdevices.ru Fixes: f24263a5a076 ("zsmalloc: remove unnecessary size_class NULL check") Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: use correct types in _first_obj_offset functionsAlexey Romanov2022-10-031-4/+4
| | | | | | | | | | | | | | | | | Since commit ffedd09fa9b0 ("zsmalloc: Stop using slab fields in struct page") we are using page->page_type (unsigned int) field instead of page->units (int) as first object offset in a subpage of zspage. So get_first_obj_offset() and set_first_obj_offset() functions should work with unsigned int type. Link: https://lkml.kernel.org/r/20220909083722.85024-1-avromanov@sberdevices.ru Fixes: ffedd09fa9b0 ("zsmalloc: Stop using slab fields in struct page") Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: zs_object_copy: replace email link to docAlexey Romanov2022-09-121-2/+2
| | | | | | | | | | | | | Emails are not documentation. [akpm@linux-foundation.org: fix whitespace damage, repair doc reference] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20220815144825.39001-1-avromanov@sberdevices.ru Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: remove unnecessary size_class NULL checkAlexey Romanov2022-09-121-7/+0
| | | | | | | | | | | | | | | | | | | | pool->size_class array elements can't be NULL, so this check is not needed. In the whole code, we assign pool->size_class[i] values that are not NULL. Releasing memory for these values occurs in the zs_destroy_pool() function, which also releases and destroys the pool. In addition, in the zs_stats_size_show() and async_free_zspage(), with similar iterations over the array, we don't check it for NULL pointer. Link: https://lkml.kernel.org/r/20220811153755.16102-3-avromanov@sberdevices.ru Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: zs_object_copy: add clarifying commentAlexey Romanov2022-09-121-0/+7
| | | | | | | | | | | | | | | | | | | | | | | Patch series "tidy up zsmalloc implementation" This patchset remove some unnecessary checks and adds a clarifying comment. While analysing zs_object_copy() function code, I spent some time to understand what the call kunmap_atomic(d_addr) is for. It seems that this point is not trivial and it is worth adding a comment. This patch (of 2): It's not obvious why kunmap_atomic(d_addr) call is needed. [akpm@linux-foundation.org: tweak comment layout] Link: https://lkml.kernel.org/r/20220811153755.16102-1-avromanov@sberdevices.ru Link: https://lkml.kernel.org/r/20220811153755.16102-2-avromanov@sberdevices.ru Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/zsmalloc: do not attempt to free IS_ERR handleSergey Senozhatsky2022-08-281-1/+1
| | | | | | | | | | | | | | | | | zsmalloc() now returns ERR_PTR values as handles, which zram accidentally can pass to zs_free(). Another bad scenario is when zcomp_compress() fails - handle has default -ENOMEM value, and zs_free() will try to free that "pointer value". Add the missing check and make sure that zs_free() bails out when ERR_PTR() is passed to it. Link: https://lkml.kernel.org/r/20220816050906.2583956-1-senozhatsky@chromium.org Fixes: c7e6f17b52e9 ("zsmalloc: zs_malloc: return ERR_PTR on failure") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds2022-08-061-6/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
| * zsmalloc: zs_malloc: return ERR_PTR on failureHui Zhu2022-07-301-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zs_malloc returns 0 if it fails. zs_zpool_malloc will return -1 when zs_malloc return 0. But -1 makes the return value unclear. For example, when zswap_frontswap_store calls zs_malloc through zs_zpool_malloc, it will return -1 to its caller. The other return value is -EINVAL, -ENODEV or something else. This commit changes zs_malloc to return ERR_PTR on failure. It didn't just let zs_zpool_malloc return -ENOMEM becaue zs_malloc has two types of failure: - size is not OK return -EINVAL - memory alloc fail return -ENOMEM. Link: https://lkml.kernel.org/r/20220714080757.12161-1-teawater@gmail.com Signed-off-by: Hui Zhu <teawater@antgroup.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: shrinkers: provide shrinkers with namesRoman Gushchin2022-07-041-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: Convert all PageMovable users to movable_operationsMatthew Wilcox (Oracle)2022-08-021-80/+22
|/ | | | | | | | | These drivers are rather uncomfortably hammered into the address_space_operations hole. They aren't filesystems and don't behave like filesystems. They just need their own movable_operations structure, which we can point to directly from page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
* zsmalloc: fix races between asynchronous zspage free and page migrationSultan Alsawaf2022-05-141-4/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The asynchronous zspage free worker tries to lock a zspage's entire page list without defending against page migration. Since pages which haven't yet been locked can concurrently migrate off the zspage page list while lock_zspage() churns away, lock_zspage() can suffer from a few different lethal races. It can lock a page which no longer belongs to the zspage and unsafely dereference page_private(), it can unsafely dereference a torn pointer to the next page (since there's a data race), and it can observe a spurious NULL pointer to the next page and thus not lock all of the zspage's pages (since a single page migration will reconstruct the entire page list, and create_page_chain() unconditionally zeroes out each list pointer in the process). Fix the races by using migrate_read_lock() in lock_zspage() to synchronize with page migration. Link: https://lkml.kernel.org/r/20220509024703.243847-1-sultan@kerneltoast.com Fixes: 77ff465799c602 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* zsmalloc: replace get_cpu_var with local_lockMike Galbraith2022-01-221-3/+8
| | | | | | | | | | | | | | | | | | | | | | | The usage of get_cpu_var() in zs_map_object() is problematic because it disables preemption and makes it impossible to acquire any sleeping lock on PREEMPT_RT such as a spinlock_t. Replace the get_cpu_var() usage with a local_lock_t which is embedded struct mapping_area. It ensures that the access the struct is synchronized against all users on the same CPU. [minchan: remove the bit_spin_lock part and change the title] Link: https://lkml.kernel.org/r/20211115185909.3949505-10-minchan@kernel.org Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Minchan Kim <minchan@kernel.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: replace per zpage lock with pool->migrate_lockMinchan Kim2022-01-221-109/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zsmalloc has used a bit for spin_lock in zpage handle to keep zpage object alive during several operations. However, it causes the problem for PREEMPT_RT as well as introducing too complicated. This patch replaces the bit spin_lock with pool->migrate_lock rwlock. It could make the code simple as well as zsmalloc work under PREEMPT_RT. The drawback is the pool->migrate_lock is bigger granuarity than per zpage lock so the contention would be higher than old when both IO-related operations(i.e., zsmalloc, zsfree, zs_[map|unmap]) and compaction(page/zpage migration) are going in parallel(*, the migrate_lock is rwlock and IO related functions are all read side lock so there is no contention). However, the write-side is fast enough(dominant overhead is just page copy) so it wouldn't affect much. If the lock granurity becomes more problem later, we could introduce table locks based on handle as a hash value. Link: https://lkml.kernel.org/r/20211115185909.3949505-9-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: remove zspage isolation for migrationMinchan Kim2022-01-221-149/+8
| | | | | | | | | | | | | | | | | | | | | | | zspage isolation for migration introduced additional exceptions to be dealt with since the zspage was isolated from class list. The reason why I isolated zspage from class list was to prevent race between obj_malloc and page migration via allocating zpage from the zspage further. However, it couldn't prevent object freeing from zspage so it needed corner case handling. This patch removes the whole mess. Now, we are fine since class->lock and zspage->lock can prevent the race. Link: https://lkml.kernel.org/r/20211115185909.3949505-7-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: move huge compressed obj from page to zspageMinchan Kim2022-01-221-24/+26
| | | | | | | | | | | | | | | The flag aims for zspage, not per page. Let's move it to zspage. Link: https://lkml.kernel.org/r/20211115185909.3949505-6-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: introduce obj_allocatedMinchan Kim2022-01-221-17/+16
| | | | | | | | | | | | | | | | The usage pattern for obj_to_head is to check whether the zpage is allocated or not. Thus, introduce obj_allocated. Link: https://lkml.kernel.org/r/20211115185909.3949505-5-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: decouple class actions from zspage worksMinchan Kim2022-01-221-10/+13
| | | | | | | | | | | | | | | | | This patch moves class stat update out of obj_malloc since it's not related to zspage operation. This is a preparation to introduce new lock scheme in next patch. Link: https://lkml.kernel.org/r/20211115185909.3949505-4-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: rename zs_stat_type to class_stat_typeMinchan Kim2022-01-221-12/+12
| | | | | | | | | | | | | | | The stat aims for class stat, not zspage so rename it. Link: https://lkml.kernel.org/r/20211115185909.3949505-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: introduce some helper functionsMinchan Kim2022-01-221-31/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "zsmalloc: remove bit_spin_lock", v2. zsmalloc uses bit_spin_lock to minimize space overhead since it's zpage granularity lock. However, it causes zsmalloc non-working under PREEMPT_RT as well as adding too much complication. This patchset tries to replace the bit_spin_lock with per-pool rwlock. It also removes unnecessary zspage isolation logic from class, which was the other part too much complication added into zsmalloc. Last patch changes the get_cpu_var to local_lock to make it work in PREEMPT_RT. This patch (of 9): get_zspage_mapping returns fullness as well as class_idx. However, the fullness is usually not used since it could be stale in some contexts. It causes misleading as well as unnecessary instructions so this patch introduces zspage_class. obj_to_location also produces page and index but we don't need always the index, either so this patch introduces obj_to_page. Link: https://lkml.kernel.org/r/20211115185909.3949505-1-minchan@kernel.org Link: https://lkml.kernel.org/r/20211115185909.3949505-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: Stop using slab fields in struct pageMatthew Wilcox (Oracle)2022-01-061-9/+9
| | | | | | | | | | | | | | | | The ->freelist and ->units members of struct page are for the use of slab only. I'm not particularly familiar with zsmalloc, so generate the same code by using page->index to store 'page' (page->index and page->freelist are at the same offset in struct page). This should be cleaned up properly at some point by somebody who is familiar with zsmalloc. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
* mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and ↵Miaohe Lin2021-11-061-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zs_unregister_migration() There is one possible race window between zs_pool_dec_isolated() and zs_unregister_migration() because wait_for_isolated_drain() checks the isolated count without holding class->lock and there is no order inside zs_pool_dec_isolated(). Thus the below race window could be possible: zs_pool_dec_isolated zs_unregister_migration check pool->destroying != 0 pool->destroying = true; smp_mb(); wait_for_isolated_drain() wait for pool->isolated_pages == 0 atomic_long_dec(&pool->isolated_pages); atomic_long_read(&pool->isolated_pages) == 0 Since we observe the pool->destroying (false) before atomic_long_dec() for pool->isolated_pages, waking pool->migration_wait up is missed. Fix this by ensure checking pool->destroying happens after the atomic_long_dec(&pool->isolated_pages). Link: https://lkml.kernel.org/r/20210708115027.7557-1-linmiaohe@huawei.com Fixes: 701d678599d0 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Henry Burns <henryburns@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zsmalloc.c: improve readability for async_free_zspage()Miaohe Lin2021-07-011-1/+1
| | | | | | | | | | | | | | The class is extracted from pool->size_class[class_idx] again before calling __free_zspage(). It looks like class will change after we fetch the class lock. But this is misleading as class will stay unchanged. Link: https://lkml.kernel.org/r/20210624123930.1769093-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zsmalloc.c: remove confusing code in obj_free()Miaohe Lin2021-07-011-1/+0
| | | | | | | | | | | | | | | | | | | | | | | Patch series "Cleanup for zsmalloc". This series contains cleanups to remove confusing code in obj_free(), combine two atomic ops and improve readability for async_free_zspage(). More details can be found in the respective changelogs. This patch (of 2): OBJ_ALLOCATED_TAG is only set for handle to indicate allocated object. It's irrelevant with obj. So remove this misleading code to improve readability. Link: https://lkml.kernel.org/r/20210624123930.1769093-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210624123930.1769093-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix typos in commentsIngo Molnar2021-05-071-1/+1
| | | | | | | | | | | | | | Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix some typos and code style problemsShijie Luo2021-05-071-2/+2
| | | | | | | | | | | | | | | | | | fix some typos and code style problems in mm. gfp.h: s/MAXNODES/MAX_NUMNODES mmzone.h: s/then/than rmap.c: s/__vma_split()/__vma_adjust() swap.c: s/__mod_zone_page_stat/__mod_zone_page_state, s/is is/is swap_state.c: s/whoes/whose z3fold.c: code style problem fix in z3fold_unregister_migration zsmalloc.c: s/of/or, s/give/given Link: https://lkml.kernel.org/r/20210419083057.64820-1-luoshijie1@huawei.com Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zsmalloc: use BUG_ON instead of if condition followed by BUG.zhouchuangao2021-05-051-4/+2
| | | | | | | | | | | It can be optimized at compile time. Link: https://lkml.kernel.org/r/1616727798-9110-1-git-send-email-zhouchuangao@vivo.com Signed-off-by: zhouchuangao <zhouchuangao@vivo.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zsmalloc.c: use page_private() to access page->privateMiaohe Lin2021-02-261-1/+1
| | | | | | | | | | | | It's recommended to use helper macro page_private() to access the private field of page. Use such helper to eliminate direct access. Link: https://lkml.kernel.org/r/20210203091857.20017-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zsmalloc: account the number of compacted pages correctlyRokudo Yan2021-02-261-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There exists multiple path may do zram compaction concurrently. 1. auto-compaction triggered during memory reclaim 2. userspace utils write zram<id>/compaction node So, multiple threads may call zs_shrinker_scan/zs_compact concurrently. But pages_compacted is a per zsmalloc pool variable and modification of the variable is not serialized(through under class->lock). There are two issues here: 1. the pages_compacted may not equal to total number of pages freed(due to concurrently add). 2. zs_shrinker_scan may not return the correct number of pages freed(issued by current shrinker). The fix is simple: 1. account the number of pages freed in zs_compact locally. 2. use actomic variable pages_compacted to accumulate total number. Link: https://lkml.kernel.org/r/20210202122235.26885-1-wu-yan@tcl.com Fixes: 860c707dca155a56 ("zsmalloc: account the number of compacted pages") Signed-off-by: Rokudo Yan <wu-yan@tcl.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>