| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Along with the usual shower of singleton patches, notable patch series
in this pull request are:
- "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
- "Some cleanups for shmem" from Baolin Wang. No functional changes -
mode code reuse, better function naming, logic simplifications.
- "mm: some small page fault cleanups" from Josef Bacik. No
functional changes - code cleanups only.
- "Various memory tiering fixes" from Zi Yan. A small fix and a
little cleanup.
- "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
- "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
Butt. This is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at
all used 16k. Useful for some system tuning things, but
partivularly useful for "the dynamic kernel stack project".
- "kmemleak: support for percpu memory leak detect" from Pavel
Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.
- "mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
- "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
David Hildenbrand. Improves PTE/PMD splitlock detection, makes
powerpc/8xx work correctly by design rather than by accident.
- "mm: remove arch_make_page_accessible()" from David Hildenbrand.
Some folio conversions which make arch_make_page_accessible()
unneeded.
- "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
Finkel. Cleans up and fixes our handling of the resetting of the
cgroup/process peak-memory-use detector.
- "Make core VMA operations internal and testable" from Lorenzo
Stoakes. Rationalizaion and encapsulation of the VMA manipulation
APIs. With a view to better enable testing of the VMA functions,
even from a userspace-only harness.
- "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
issues in the zswap global shrinker, resulting in improved
performance.
- "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
in some missing info in /proc/zoneinfo.
- "mm: replace follow_page() by folio_walk" from David Hildenbrand.
Code cleanups and rationalizations (conversion to folio_walk())
resulting in the removal of follow_page().
- "improving dynamic zswap shrinker protection scheme" from Nhat
Pham. Some tuning to improve zswap's dynamic shrinker. Significant
reductions in swapin and improvements in performance are shown.
- "mm: Fix several issues with unaccepted memory" from Kirill
Shutemov. Improvements to the new unaccepted memory feature,
- "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
DAX PUDs. This was missing, although nobody seems to have notied
yet.
- "Introduce a store type enum for the Maple tree" from Sidhartha
Kumar. Cleanups and modest performance improvements for the maple
tree library code.
- "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
more cgroup v1 remnants away from the v2 memcg code.
- "memcg: initiate deprecation of v1 features" from Shakeel Butt.
Adds various warnings telling users that memcg v1 features are
deprecated.
- "mm: swap: mTHP swap allocator base on swap cluster order" from
Chris Li. Greatly improves the success rate of the mTHP swap
allocation.
- "mm: introduce numa_memblks" from Mike Rapoport. Moves various
disparate per-arch implementations of numa_memblk code into generic
code.
- "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
- "support large folio swap-out and swap-in for shmem" from Baolin
Wang. With this series we no longer split shmem large folios into
simgle-page folios when swapping out shmem.
- "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
performance improvements and code reductions for gigantic folios.
- "support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
- "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
- "Increase the number of bits available in page_type" from Matthew
Wilcox. Increases the number of bits available in page_type!
- "Simplify the page flags a little" from Matthew Wilcox. Many legacy
page flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
- "mm: store zero pages to be swapped out in a bitmap" from Usama
Arif. An optimization which permits us to avoid writing/reading
zero-filled zswap pages to backing store.
- "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
window which occurs when a MAP_FIXED operqtion is occurring during
an unrelated vma tree walk.
- "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
the vma_merge() functionality, making ot cleaner, more testable and
better tested.
- "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
Minor fixups of DAMON selftests and kunit tests.
- "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
Code cleanups and folio conversions.
- "Shmem mTHP controls and stats improvements" from Ryan Roberts.
Cleanups for shmem controls and stats.
- "mm: count the number of anonymous THPs per size" from Barry Song.
Expose additional anon THP stats to userspace for improved tuning.
- "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
folio conversions and removal of now-unused page-based APIs.
- "replace per-quota region priorities histogram buffer with
per-context one" from SeongJae Park. DAMON histogram
rationalization.
- "Docs/damon: update GitHub repo URLs and maintainer-profile" from
SeongJae Park. DAMON documentation updates.
- "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
improve related doc and warn" from Jason Wang: fixes usage of page
allocator __GFP_NOFAIL and GFP_ATOMIC flags.
- "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
This was overprovisioning THPs in sparsely accessed memory areas.
- "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
Add support for zram run-time compression algorithm tuning.
- "mm: Care about shadow stack guard gap when getting an unmapped
area" from Mark Brown. Fix up the various arch_get_unmapped_area()
implementations to better respect guard areas.
- "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
of mem_cgroup_iter() and various code cleanups.
- "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
- "resource: Fix region_intersects() vs add_memory_driver_managed()"
from Huang Ying. Fix a bug in region_intersects() for systems with
CXL memory.
- "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
a couple more code paths to correctly recover from the encountering
of poisoned memry.
- "mm: enable large folios swap-in support" from Barry Song. Support
the swapin of mTHP memory into appropriately-sized folios, rather
than into single-page folios"
* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
zram: free secondary algorithms names
uprobes: turn xol_area->pages[2] into xol_area->page
uprobes: introduce the global struct vm_special_mapping xol_mapping
Revert "uprobes: use vm_special_mapping close() functionality"
mm: support large folios swap-in for sync io devices
mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
set_memory: add __must_check to generic stubs
mm/vma: return the exact errno in vms_gather_munmap_vmas()
memcg: cleanup with !CONFIG_MEMCG_V1
mm/show_mem.c: report alloc tags in human readable units
mm: support poison recovery from copy_present_page()
mm: support poison recovery from do_cow_fault()
resource, kunit: add test case for region_intersects()
resource: make alloc_free_mem_region() works for iomem_resource
mm: z3fold: deprecate CONFIG_Z3FOLD
vfio/pci: implement huge_fault support
mm/arm64: support large pfn mappings
mm/x86: support large pfn mappings
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We will store a per-algorithm parameters there (compression level,
dictionary, dictionary size, etc.).
Link: https://lkml.kernel.org/r/20240902105656.1383858-14-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The zram_table_entry::flags member is of type long and uses 8 bytes on a
64bit architecture. With a PAGE_SIZE of 256KiB we have PAGE_SHIFT of 18
which in turn leads to __NR_ZRAM_PAGEFLAGS = 27. This still fits in an
ordinary integer.
By reducing the size of `flags' to four bytes, the size of the struct
goes back to 16 bytes. The padding between the lock and ac_time (if
enabled) is also gone.
Make zram_table_entry::flags an unsigned int and update the build test
to reflect the change.
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-4-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The ZRAM_LOCK was used for locking and after the addition of spinlock_t
the bit set and cleared but there no reader of it.
Remove the ZRAM_LOCK bit.
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-3-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The bit spinlock disables preemption. The spinlock_t lock becomes a sleeping
lock on PREEMPT_RT and it can not be acquired in this context. In this locked
section, zs_free() acquires a zs_pool::lock, and there is access to
zram::wb_limit_lock.
Add a spinlock_t for locking. Keep the set/ clear ZRAM_LOCK bit after
the lock has been acquired/ dropped. The size of struct zram_table_entry
increases by 4 bytes due to lock and additional 4 bytes padding with
CONFIG_ZRAM_TRACK_ENTRY_ACTIME enabled.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-2-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
| |
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-12-adbd023e19cc@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZRAM_MEMORY_TRACKING enables two features:
- per-entry ac-time tracking
- debugfs interface
The latter one is the reason why memory-tracking depends on DEBUG_FS,
while the former one is used far beyond debugging these days. Namely
ac-time is used for fine grained writeback of idle entries (pages).
Move ac-time tracking under its own config option so that it can be
enabled (along with writeback) on systems without DEBUG_FS.
[senozhatsky@chromium.org: ifdef fixup, per Dmytro]
Link: https://lkml.kernel.org/r/20231117013543.540280-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20231115024223.4133148-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dmytro Maluka <dmaluka@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Convert zram to use bdev_open_by_dev() and pass the handle around.
CC: Minchan Kim <minchan@kernel.org>
CC: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-8-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
All bios hande to drivers from the block layer are checked against the
device size and for logical block alignment already (and have been since
long before zram was merged), so don't duplicate those checks.
Link: https://lkml.kernel.org/r/20230411171459.567614-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't show num_reads and num_writes since we removed corresponding
sysfs nodes in 2017. Block layer stats are exposed via
/sys/block/zramX/stat file.
However, we still increment those atomic vars and store them in zram
stats. Remove leftovers.
Link: https://lkml.kernel.org/r/20221117141326.1105181-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recompression iterates through all the registered secondary compression
algorithms in order of their priorities so that we have higher chances of
finding the algorithm that compresses a particular page. This, however,
may not always be best approach and sometimes we may want to limit
recompression to only one particular algorithm. For instance, when a
higher priority algorithm uses too much power and device has a relatively
low battery level we may want to limit recompression to use only a lower
priority algorithm, which uses less power.
Introduce algo= parameter support to recompression sysfs knob so that
user-sapce can request recompression with particular algorithm only:
echo "type=idle algo=zstd" > /sys/block/zramX/recompress
Link: https://lkml.kernel.org/r/20221109115047.2921851-11-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Allow zram to recompress (using secondary compression streams)
pages.
Re-compression algorithms (we support up to 3 at this stage)
are selected via recomp_algorithm:
echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm
Please read documentation for more details.
We support several recompression modes:
1) IDLE pages recompression is activated by `idle` mode
echo "type=idle" > /sys/block/zram0/recompress
2) Since there may be many idle pages user-space may pass a size
threshold value (in bytes) and we will recompress pages only
of equal or greater size:
echo "threshold=888" > /sys/block/zram0/recompress
3) HUGE pages recompression is activated by `huge` mode
echo "type=huge" > /sys/block/zram0/recompress
4) HUGE_IDLE pages recompression is activated by `huge_idle` mode
echo "type=huge_idle" > /sys/block/zram0/recompress
[senozhatsky@chromium.org: we should always zero out err variable in recompress loop[
Link: https://lkml.kernel.org/r/20221110143423.3250790-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20221109115047.2921851-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Patch series "zram: Support multiple compression streams", v5.
This series adds support for multiple compression streams. The main idea
is that different compression algorithms have different characteristics
and zram may benefit when it uses a combination of algorithms: a default
algorithm that is faster but have lower compression rate and a secondary
algorithm that can use higher compression rate at a price of slower
compression/decompression.
There are several use-case for this functionality:
- huge pages re-compression: zstd or deflate can successfully compress
huge pages (~50% of huge pages on my synthetic ChromeOS tests), IOW
pages that lzo was not able to compress.
- idle pages re-compression: idle/cold pages sit in the memory and we
may reduce zsmalloc memory usage if we recompress those idle pages.
Userspace has a number of ways to control the behavior and impact of zram
recompression: what type of pages should be recompressed, size watermarks,
etc. Please refer to documentation patch.
This patch (of 13):
The patch turns compression streams and compressor algorithm name struct
zram members into arrays, so that we can have multiple compression streams
support (in the next patches).
The patch uses a rather explicit API for compressor selection:
- Get primary (default) compression stream
zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP])
- Get secondary compression stream
zcomp_stream_get(zram->comps[ZRAM_SECONDARY_COMP])
We use similar API for compression streams put().
At this point we always have just one compression stream,
since CONFIG_ZRAM_MULTI_COMP is not yet defined.
Link: https://lkml.kernel.org/r/20221109115047.2921851-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20221109115047.2921851-2-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zram_table_entry::flags stores object size in the lower bits and zram
pageflags in the upper bits. However, for some reason, we use 24 lower
bits, while maximum zram object size is PAGE_SIZE, which requires
PAGE_SHIFT bits (up to 16 on arm64). This wastes 24 - PAGE_SHIFT bits
that we can use for additional zram pageflags instead.
Also add a BUILD_BUG_ON() to alert us should we run out of bits in
zram_table_entry::flags.
Link: https://lkml.kernel.org/r/20220912152744.527438-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit e7be8d1dd983156b ("zram: remove double compression
logic") as it causes zram failures. It does not revert cleanly, PTR_ERR
handling was introduced in the meantime. This is handled by appropriate
IS_ERR.
When under memory pressure, zs_malloc() can fail. Before the above
commit, the allocation was retried with direct reclaim enabled (GFP_NOIO).
After the commit, it is not -- only __GFP_KSWAPD_RECLAIM is tried.
So when the failure occurs under memory pressure, the overlaying
filesystem such as ext2 (mounted by ext4 module in this case) can emit
failures, making the (file)system unusable:
EXT4-fs warning (device zram0): ext4_end_bio:343: I/O error 10 writing to inode 16386 starting block 159744)
Buffer I/O error on device zram0, logical block 159744
With direct reclaim, memory is really reclaimed and allocation succeeds,
eventually. In the worst case, the oom killer is invoked, which is proper
outcome if user sets up zram too large (in comparison to available RAM).
This very diff doesn't apply to 5.19 (stable) cleanly (see PTR_ERR note
above). Use revert of e7be8d1dd983 directly.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1202203
Link: https://lkml.kernel.org/r/20220810070609.14402-1-jslaby@suse.cz
Fixes: e7be8d1dd983 ("zram: remove double compression logic")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Alexey Romanov <avromanov@sberdevices.ru>
Cc: Dmitry Rokosov <ddrokosov@sberdevices.ru>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: <stable@vger.kernel.org> [5.19]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 2nd trial allocation under per-cpu presumption has been used to
prevent regression of allocation failure. However, it makes trouble for
maintenance without significant benefit. The slowpath branch is executed
extremely rarely: getting there is problematic. Therefore, we delete this
branch.
Since b09ab054b69b ("zram: support BDI_CAP_STABLE_WRITES"), zram has used
QUEUE_FLAG_STABLE_WRITES to prevent buffer change between 1st and 2nd
memory allocations. Since we remove second trial memory allocation logic,
we could remove the STABLE_WRITES flag because there is no change buffer
to be modified under us.
Link: https://lkml.kernel.org/r/20220505094443.11728-1-avromanov@sberdevices.ru
Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru>
Signed-off-by: Dmitry Rokosov <ddrokosov@sberdevices.ru>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
backing_dev is never used when not enable CONFIG_ZRAM_WRITEBACK and it's
introduced from writeback feature. So it's needless also affect
readability in that case.
Link: https://lkml.kernel.org/r/20210521060544.2385-1-zbestahu@gmail.com
Signed-off-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|/
|
|
|
|
|
|
|
|
|
| |
Replace the per-block device bd_mutex with a per-gendisk open_mutex,
thus simplifying locking wherever we deal with partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull block updates from Jens Axboe:
"Another series of killing more code than what is being added, again
thanks to Christoph's relentless cleanups and tech debt tackling.
This contains:
- blk-iocost improvements (Baolin Wang)
- part0 iostat fix (Jeffle Xu)
- Disable iopoll for split bios (Jeffle Xu)
- block tracepoint cleanups (Christoph Hellwig)
- Merging of struct block_device and hd_struct (Christoph Hellwig)
- Rework/cleanup of how block device sizes are updated (Christoph
Hellwig)
- Simplification of gendisk lookup and removal of block device
aliasing (Christoph Hellwig)
- Block device ioctl cleanups (Christoph Hellwig)
- Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)
- Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)
- sbitmap improvements (Pavel Begunkov)
- Hybrid polling fix (Pavel Begunkov)
- bvec iteration improvements (Pavel Begunkov)
- Zone revalidation fixes (Damien Le Moal)
- blk-throttle limit fix (Yu Kuai)
- Various little fixes"
* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
blk-mq: fix msec comment from micro to milli seconds
blk-mq: update arg in comment of blk_mq_map_queue
blk-mq: add helper allocating tagset->tags
Revert "block: Fix a lockdep complaint triggered by request queue flushing"
nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
block: disable iopoll for split bio
block: Improve blk_revalidate_disk_zones() checks
sbitmap: simplify wrap check
sbitmap: replace CAS with atomic and
sbitmap: remove swap_lock
sbitmap: optimise sbitmap_deferred_clear()
blk-mq: skip hybrid polling if iopoll doesn't spin
blk-iocost: Factor out the base vrate change into a separate function
blk-iocost: Factor out the active iocgs' state check into a separate function
blk-iocost: Move the usage ratio calculation to the correct place
blk-iocost: Remove unnecessary advance declaration
blk-iocost: Fix some typos in comments
blktrace: fix up a kerneldoc comment
block: remove the request_queue to argument request based tracepoints
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
set_blocksize is used by file systems to use their preferred buffer cache
block size. Block drivers should not set it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|/
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, zram supports the stat via /sys/block/zram/mm_stat to represent
how many of incompressible pages are stored at the moment but it couldn't
show how many times incompressible pages were wrote down since zram set
up. It's also good indication to see how zram is effective in the system.
Link: https://lkml.kernel.org/r/20201130201907.1284910-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch includes some fixes and cleanup for idle-page writeback.
1. writeback_limit interface
Now writeback_limit interface is rather conusing. For example, once
writeback limit budget is exausted, admin can see 0 from
/sys/block/zramX/writeback_limit which is same semantic with disable
writeback_limit at this moment. IOW, admin cannot tell that zero came
from disable writeback limit or exausted writeback limit.
To make the interface clear, let's sepatate enable of writeback limit to
another knob - /sys/block/zram0/writeback_limit_enable
* before:
while true :
# to re-enable writeback limit once previous one is used up
echo 0 > /sys/block/zram0/writeback_limit
echo $((200<<20)) > /sys/block/zram0/writeback_limit
..
.. # used up the writeback limit budget
* new
# To enable writeback limit, from the beginning, admin should
# enable it.
echo $((200<<20)) > /sys/block/zram0/writeback_limit
echo 1 > /sys/block/zram/0/writeback_limit_enable
while true :
echo $((200<<20)) > /sys/block/zram0/writeback_limit
..
.. # used up the writeback limit budget
It's much strightforward.
2. fix condition check idle/huge writeback mode check
The mode in writeback_store is not bit opeartion any more so no need to
use bit operations. Furthermore, current condition check is broken in
that it does writeback every pages regardless of huge/idle.
3. clean up idle_store
No need to use goto.
[minchan@kernel.org: missed spin_lock_init]
Link: http://lkml.kernel.org/r/20190103001601.GA255139@google.com
Link: http://lkml.kernel.org/r/20181224033529.19450-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Suggested-by: John Dias <joaodias@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: John Dias <joaodias@google.com>
Cc: Srinivas Paladugu <srnvs@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If there are lots of write IO with flash device, it could have a
wearout problem of storage. To overcome the problem, admin needs
to design write limitation to guarantee flash health
for entire product life.
This patch creates a new knob "writeback_limit" for zram.
writeback_limit's default value is 0 so that it doesn't limit
any writeback. If admin want to measure writeback count in a
certain period, he could know it via /sys/block/zram0/bd_stat's
3rd column.
If admin want to limit writeback as per-day 400M, he could do it
like below.
MB_SHIFT=20
4K_SHIFT=12
echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit.
If admin want to allow further write again, he could do it like below
echo 0 > /sys/block/zram0/writeback_limit
If admin want to see remaining writeback budget,
cat /sys/block/zram0/writeback_limit
The writeback_limit count will reset whenever you reset zram (e.g., system
reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback
happened until you reset the zram to allocate extra writeback budget in
next setting is user's job.
[minchan@kernel.org: v4]
Link: http://lkml.kernel.org/r/20181203024045.153534-8-minchan@kernel.org
Link: http://lkml.kernel.org/r/20181127055429.251614-8-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joey Pabalinas <joeypabalinas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bd_stat represents things that happened in the backing device. Currently
it supports bd_counts, bd_reads and bd_writes which are helpful to
understand wearout of flash and memory saving.
[minchan@kernel.org: v4]
Link: http://lkml.kernel.org/r/20181203024045.153534-7-minchan@kernel.org
Link: http://lkml.kernel.org/r/20181127055429.251614-7-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joey Pabalinas <joeypabalinas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a new feature "zram idle/huge page writeback". In the zram-swap use
case, zram usually has many idle/huge swap pages. It's pointless to keep
them in memory (ie, zram).
To solve this problem, this feature introduces idle/huge page writeback to
the backing device so the goal is to save more memory space on embedded
systems.
Normal sequence to use idle/huge page writeback feature is as follows,
while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.
echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}
Per the discussion at
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,
This patch removes direct incommpressibe page writeback feature
(d2afd25114f4 ("zram: write incompressible pages to backing device")).
Below concerns from Sergey:
== &< ==
"IDLE writeback" is superior to "incompressible writeback".
"incompressible writeback" is completely unpredictable and uncontrollable;
it depens on data patterns and compression algorithms. While "IDLE
writeback" is predictable.
I even suspect, that, *ideally*, we can remove "incompressible writeback".
"IDLE pages" is a super set which also includes "incompressible" pages.
So, technically, we still can do "incompressible writeback" from "IDLE
writeback" path; but a much more reasonable one, based on a page idling
period.
I understand that you want to keep "direct incompressible writeback"
around. ZRAM is especially popular on devices which do suffer from flash
wearout, so I can see "incompressible writeback" path becoming a dead
code, long term.
== &< ==
Below concerns from Minchan:
== &< ==
My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
both hugepage/idlepage writeck will turn on. However someuser want to
enable only idlepage writeback so we need to introduce turn on/off knob
for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase. I
don't want to make it complicated *if possible*.
Long term, I imagine we need to make VM aware of new swap hierarchy a
little bit different with as-is. For example, first high priority swap
can return -EIO or -ENOCOMP, swap try to fallback to next lower priority
swap device. With that, hugepage writeback will work tranparently.
So we could regard it as regression because incompressible pages doesn't
go to backing storage automatically. Instead, user should do it via "echo
huge" > /sys/block/zram/writeback" manually.
== &< ==
Link: http://lkml.kernel.org/r/20181127055429.251614-6-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.
Userspace can mark zram slots as "idle" via
"echo all > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.
300 75.033841 ...i
301 63.806904 s..i
302 63.806919 ..hi
Once there is IO for the slot, the mark will be disappeared.
300 75.033841 ...
301 63.806904 s..i
302 63.806919 ..hi
Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.
Link: http://lkml.kernel.org/r/20181127055429.251614-5-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rename some variables and restructure some code for better readability in
writeback and zs_free_page.
Link: http://lkml.kernel.org/r/20181127055429.251614-4-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Patch series "zram idle page writeback", v3.
Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.
This patchset supports zram idle page writeback feature.
* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout
Details are in each patch's description.
This patch (of 7):
================================
WARNING: inconsistent lock state
4.19.0+ #390 Not tainted
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50
{SOFTIRQ-ON-W} state was registered at:
_raw_spin_lock+0x2c/0x40
zram_make_request+0x755/0xdc9
generic_make_request+0x373/0x6a0
submit_bio+0x6c/0x140
__swap_writepage+0x3a8/0x480
shrink_page_list+0x1102/0x1a60
shrink_inactive_list+0x21b/0x3f0
shrink_node_memcg.constprop.99+0x4f8/0x7e0
shrink_node+0x7d/0x2f0
do_try_to_free_pages+0xe0/0x300
try_to_free_pages+0x116/0x2b0
__alloc_pages_slowpath+0x3f4/0xf80
__alloc_pages_nodemask+0x2a2/0x2f0
__handle_mm_fault+0x42e/0xb50
handle_mm_fault+0x55/0xb0
__do_page_fault+0x235/0x4b0
page_fault+0x1e/0x30
irq event stamp: 228412
hardirqs last enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600
hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600
softirqs last enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427
softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&zram->bitmap_lock)->rlock);
<Interrupt>
lock(&(&zram->bitmap_lock)->rlock);
*** DEADLOCK ***
no locks held by zram_verify/2095.
stack backtrace:
CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
<IRQ>
dump_stack+0x67/0x9b
print_usage_bug+0x1bd/0x1d3
mark_lock+0x4aa/0x540
__lock_acquire+0x51d/0x1300
lock_acquire+0x90/0x180
_raw_spin_lock+0x2c/0x40
put_entry_bdev+0x1e/0x50
zram_free_page+0xf6/0x110
zram_slot_free_notify+0x42/0xa0
end_swap_bio_read+0x5b/0x170
blk_update_request+0x8f/0x340
scsi_end_request+0x2c/0x1e0
scsi_io_completion+0x98/0x650
blk_done_softirq+0x9e/0xd0
__do_softirq+0xcc/0x427
irq_exit+0xd1/0xe0
do_IRQ+0x93/0x120
common_interrupt+0xf/0xf
</IRQ>
With writeback feature, zram_slot_free_notify could be called in softirq
context by end_swap_bio_read. However, bitmap_lock is not aware of that
so lockdep yell out:
get_entry_bdev
spin_lock(bitmap->lock);
irq
softirq
end_swap_bio_read
zram_slot_free_notify
zram_slot_lock <-- deadlock prone
zram_free_page
put_entry_bdev
spin_lock(bitmap->lock); <-- deadlock prone
With akpm's suggestion (i.e. bitmap operation is already atomic), we
could remove bitmap lock. It might fail to find a empty slot if serious
contention happens. However, it's not severe problem because huge page
writeback has already possiblity to fail if there is severe memory
pressure. Worst case is just keeping the incompressible in memory, not
storage.
The other problem is zram_slot_lock in zram_slot_slot_free_notify. To
make it safe is this patch introduces zram_slot_trylock where
zram_slot_free_notify uses it. Although it's rare to be contented, this
patch adds new debug stat "miss_free" to keep monitoring how often it
happens.
Link: http://lkml.kernel.org/r/20181127055429.251614-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Joey Pabalinas <joeypabalinas@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out. zRAM can
store such cold pages as compressed form but it's pointless to keep in
memory. Better idea is app developers free them directly rather than
remaining them on heap.
This patch tell us last access time of each block of zram via "cat
/sys/kernel/debug/zram/zram0/block_state".
The output is as follows,
300 75.033841 .wh
301 63.806904 s..
302 63.806919 ..h
First column is zram's block index and 3rh one represents symbol (s:
same page w: written page to backing store h: huge page) of the block
state. Second column represents usec time unit of the block was last
accessed. So above example means the 300th block is accessed at
75.033851 second and it was huge so it was written to the backing store.
Admin can leverage this information to catch cold|incompressible pages
of process with *pagemap* once part of heaps are swapped out.
I used the feature a few years ago to find memory hoggers in userspace
to notify them what memory they have wasted without touch for a long
time. With it, they could reduce unnecessary memory space. However, at
that time, I hacked up zram for the feature but now I need the feature
again so I decided it would be better to upstream rather than keeping it
alone. I hope I submit the userspace tool to use the feature soon.
[akpm@linux-foundation.org: fix i386 printk warning]
[minchan@kernel.org: use ktime_get_boottime() instead of sched_clock()]
Link: http://lkml.kernel.org/r/20180420063525.GA253739@rodete-desktop-imager.corp.google.com
[akpm@linux-foundation.org: documentation tweak]
[akpm@linux-foundation.org: fix i386 printk warning]
[minchan@kernel.org: fix compile warning]
Link: http://lkml.kernel.org/r/20180508104849.GA8209@rodete-desktop-imager.corp.google.com
[rdunlap@infradead.org: fix printk formats]
Link: http://lkml.kernel.org/r/3652ccb1-96ef-0b0b-05d1-f661d7733dcc@infradead.org
Link: http://lkml.kernel.org/r/20180416090946.63057-5-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out. zRAM can
store such cold pages as compressed form but it's pointless to keep in
memory. Better idea is app developers free them directly rather than
remaining them on heap.
This patch records last access time of each block of zram so that With
upcoming zram memory tracking, it could help userspace developers to
reduce memory footprint.
Link: http://lkml.kernel.org/r/20180416090946.63057-4-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Mark incompressible pages so that we could investigate who is the owner
of the incompressible pages once the page is swapped out via using
upcoming zram memory tracker feature.
With it, we could prevent such pages to be swapped out by using mlock.
Otherwise we might remove them.
This patch exposes new stat for huge pages via mm_stat.
Link: http://lkml.kernel.org/r/20180416090946.63057-3-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Patch series "zram memory tracking", v5.
zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out. zRAM can
store such cold pages as compressed form but it's pointless to keep in
memory. As well, it's pointless to store incompressible pages to zram
so better idea is app developers manages them directly like free or
mlock rather than remaining them on heap.
This patch provides a debugfs /sys/kernel/debug/zram/zram0/block_state
to represent each block's state so admin can investigate what memory is
cold|incompressible|same page with using pagemap once the pages are
swapped out.
The output is as follows:
300 75.033841 .wh
301 63.806904 s..
302 63.806919 ..h
First column is zram's block index and 3rh one represents symbol (s:
same page w: written page to backing store h: huge page) of the block
state. Second column represents usec time unit of the block was last
accessed. So above example means the 300th block is accessed at
75.033851 second and it was huge so it was written to the backing store.
This patch (of 4):
ZRAM_ACCESS is used for locking a slot of zram so correct the name. It
is also not a common flag to indicate status of the block so move the
declare position on top of the flag. Lastly, let's move the function to
the top of source code to be able to use it easily without forward
declaration.
Link: http://lkml.kernel.org/r/20180416090946.63057-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- the v9fs maintainers have been missing for a long time. I've taken
over v9fs patch slinging.
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (116 commits)
mm,oom_reaper: check for MMF_OOM_SKIP before complaining
mm/ksm: fix interaction with THP
mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
headers: untangle kmemleak.h from mm.h
include/linux/mmdebug.h: make VM_WARN* non-rvals
mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
mm: change return type to vm_fault_t
mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
kernel/fork.c: detect early free of a live mm
mm: make counting of list_lru_one::nr_items lockless
mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
block_invalidatepage(): only release page if the full page was invalidated
mm: kernel-doc: add missing parameter descriptions
mm/swap.c: remove @cold parameter description for release_pages()
mm/nommu: remove description of alloc_vm_area
zram: drop max_zpage_size and use zs_huge_class_size()
zsmalloc: introduce zs_huge_class_size()
mm: fix races between swapoff and flush dcache
fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Remove ZRAM's enforced "huge object" value and use zsmalloc huge-class
watermark instead, which makes more sense.
TEST
- I used a 1G zram device, LZO compression back-end, original
data set size was 444MB. Looking at zsmalloc classes stats the
test ended up to be pretty fair.
BASE ZRAM/ZSMALLOC
=====================
zram mm_stat
498978816 191482495 199831552 0 199831552 15634 0
zsmalloc classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
151 2448 0 0 1240 1240 744 3 0
168 2720 0 0 4200 4200 2800 2 0
190 3072 0 0 10100 10100 7575 3 0
202 3264 0 0 380 380 304 4 0
254 4096 0 0 10620 10620 10620 1 0
Total 7 46 106982 106187 48787 0
PATCHED ZRAM/ZSMALLOC
=====================
zram mm_stat
498978816 182579184 194248704 0 194248704 15628 0
zsmalloc classes
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
151 2448 0 0 1240 1240 744 3 0
168 2720 0 0 4200 4200 2800 2 0
190 3072 0 0 10100 10100 7575 3 0
202 3264 0 0 7180 7180 5744 4 0
254 4096 0 0 3820 3820 3820 1 0
Total 8 45 106959 106193 47424 0
As we can see, we reduced the number of objects stored in class-4096,
because a huge number of objects which we previously forcibly stored in
class-4096 now stored in non-huge class-3264. This results in lower
memory consumption:
- zsmalloc now uses 47424 physical pages, which is less than 48787 pages
zsmalloc used before.
- objects that we store in class-3264 share zspages. That's why overall
the number of pages that both class-4096 and class-3264 consumed went
down from 10924 to 9564.
[sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
Link: http://lkml.kernel.org/r/20180314081833.1096-3-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20180306070639.7389-3-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It happens often while I'm preparing a patch for a block driver that
I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
available for this driver? Do I have to introduce definitions of these
constants before I can use these constants? To avoid this confusion,
move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
<linux/blkdev.h> header file such that these become available for all
block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
header file conditional to avoid that including that header file after
<linux/blkdev.h> causes the compiler to complain about a SECTOR_SIZE
redefinition.
Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
not been removed from uapi header files nor from NAND drivers in
which these constants are used for another purpose than converting
block layer offsets and sizes into a number of sectors.
Cc: David S. Miller <davem@davemloft.net>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch enables write IO to transfer data to backing device. For
that, it implements write_to_bdev function which creates new bio and
chaining with parent bio to make the parent bio asynchrnous.
For rw_page which don't have parent bio, it submit owned bio and handle
IO completion by zram_page_end_io.
Also, this patch defines new flag ZRAM_WB to mark written page for later
read IO.
[xieyisheng1@huawei.com: fix typo in comment]
Link: http://lkml.kernel.org/r/1502707447-6944-2-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1498459987-24562-8-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With backing device, zram needs management of free space of backing
device.
This patch adds bitmap logic to manage free space which is very naive.
However, it would be simple enough as considering uncompressible pages's
frequenty in zram.
Link: http://lkml.kernel.org/r/1498459987-24562-6-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For writeback feature, user should set up backing device before the zram
working.
This patch enables the interface via /sys/block/zramX/backing_dev.
Currently, it supports block device only but it could be enhanced for
file as well.
Link: http://lkml.kernel.org/r/1498459987-24562-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's redundant now. Instead, remove it and use zram structure directly.
Link: http://lkml.kernel.org/r/1492052365-16169-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The idea is that without doing more calculations we extend zero pages to
same element pages for zram. zero page is special case of same element
page with zero element.
1. the test is done under android 7.0
2. startup too many applications circularly
3. sample the zero pages, same pages (none-zero element)
and total pages in function page_zero_filled
the result is listed as below:
ZERO SAME TOTAL
36214 17842 598196
ZERO/TOTAL SAME/TOTAL (ZERO+SAME)/TOTAL ZERO/SAME
AVERAGE 0.060631909 0.024990816 0.085622726 2.663825038
STDEV 0.00674612 0.005887625 0.009707034 2.115881328
MAX 0.069698422 0.030046087 0.094975336 7.56043956
MIN 0.03959586 0.007332205 0.056055193 1.928985507
from the above data, the benefit is about 2.5% and up to 3% of total
swapout pages.
The defect of the patch is that when we recovery a page from non-zero
element the operations are low efficient for partial read.
This patch extends zero_page to same_page so if there is any user to
have monitored zero_pages, he will be surprised if the number is
increased but it's not harmful, I believe.
[minchan@kernel.org: do not free same element pages in zram_meta_free]
Link: http://lkml.kernel.org/r/20170207065741.GA2567@bbox
Link: http://lkml.kernel.org/r/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com
Link: http://lkml.kernel.org/r/1486307804-27903-1-git-send-email-minchan@kernel.org
Signed-off-by: zhouxianrong <zhouxianrong@huawei.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zram_reset_device() waits for ongoing writepage pages to be completed by
zram->refcount logic. However, it's pointless because before the reset,
we prevent further opening of zram by zram->claim and flush all of
pending IO by fsync_bdev so there should be no pending IO at the
zram_reset_device().
So let's remove that code which is even broken due to the lack of
wake_up elsewhere.
Link: http://lkml.kernel.org/r/1485145031-11661-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is no way to get a string with all the crypto comp algorithms
supported by the crypto comp engine, so we need to maintain our own
backends list. At the same time we additionally need to use
crypto_has_comp() to make sure that the user has requested a compression
algorithm that is recognized by the crypto comp engine. Relying on
/proc/crypto is not an options here, because it does not show
not-yet-inserted compression modules.
Example:
modprobe zram
cat /proc/crypto | grep -i lz4
modprobe lz4
cat /proc/crypto | grep -i lz4
name : lz4
driver : lz4-generic
module : lz4
So the user can't tell exactly if the lz4 is really supported from
/proc/crypto output, unless someone or something has loaded it.
This patch also adds crypto_has_comp() to zcomp_available_show(). We
store all the compression algorithms names in zcomp's `backends' array,
regardless the CONFIG_CRYPTO_FOO configuration, but show only those that
are also supported by crypto engine. This helps user to know the exact
list of compression algorithms that can be used.
Example:
module lz4 is not loaded yet, but is supported by the crypto
engine. /proc/crypto has no information on this module, while
zram's `comp_algorithm' lists it:
cat /proc/crypto | grep -i lz4
cat /sys/block/zram0/comp_algorithm
[lzo] lz4 deflate lz4hc 842
We still use the `backends' array to determine if the requested
compression backend is known to crypto api. This array, however, may not
contain some entries, therefore as the last step we call crypto_has_comp()
function which attempts to insmod the requested compression algorithm to
determine if crypto api supports it. The advantage of this method is that
now we permit the usage of out-of-tree crypto compression modules
(implementing S/W or H/W compression).
[sergey.senozhatsky@gmail.com: zram-use-crypto-api-to-check-alg-availability-v3]
Link: http://lkml.kernel.org/r/20160604024902.11778-4-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20160531122017.2878-5-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
debug_stat sysfs is read-only and represents various debugging data that
zram developers may need. This file is not meant to be used by anyone
else: its content is not documented and will change any time w/o any
notice. Therefore, the output of debug_stat file contains a version
string. To avoid any confusion, we will increase the version number
every time we modify the output.
At the moment this file exports only one value -- the number of
re-compressions, IOW, the number of times compression fast path has
failed. This stat is temporary any will be useful in case if any
per-cpu compression streams regressions will be reported.
Link: http://lkml.kernel.org/r/20160513230834.GB26763@bbox
Link: http://lkml.kernel.org/r/20160511134553.12655-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove the internal part of max_comp_streams interface, since we
switched to per-cpu streams. We will keep RW max_comp_streams attr
around, because:
a) we may (silently) switch back to idle compression streams list and
don't want to disturb user space
b) max_comp_streams attr must wait for the next 'lay off cycle'; we
give user space 2 years to adjust before we remove/downgrade the attr,
and there are already several attrs scheduled for removal in 4.11, so
it's too late for max_comp_streams.
This slightly change a user visible behaviour:
- First, reading from max_comp_stream file now will always return the
number of online CPUs.
- Second, writing to max_comp_stream will not take any effect.
Link: http://lkml.kernel.org/r/20160503165546.25201-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`zs_compact_control' accounts the number of migrated objects but it has
a limited lifespan -- we lose it as soon as zs_compaction() returns back
to zram. It worked fine, because (a) zram had it's own counter of
migrated objects and (b) only zram could trigger compaction. However,
this does not work for automatic pool compaction (not issued by zram).
To account objects migrated during auto-compaction (issued by the
shrinker) we need to store this number in zs_pool.
Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
there. It provides only `num_migrated', as of this writing, but it
surely can be extended.
A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
caller.
Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Original patch from Minchan Kim <minchan@kernel.org> ]
Commit ba6b17d68c8e ("zram: fix umount-reset_store-mount race
condition") introduced bdev->bd_mutex to protect a race between mount
and reset. At that time, we don't have dynamic zram-add/remove feature
so it was okay.
However, as we introduce dynamic device feature, bd_mutex became
trouble.
CPU 0
echo 1 > /sys/block/zram<id>/reset
-> kernfs->s_active(A)
-> zram:reset_store->bd_mutex(B)
CPU 1
echo <id> > /sys/class/zram/zram-remove
->zram:zram_remove: bd_mutex(B)
-> sysfs_remove_group
-> kernfs->s_active(A)
IOW, AB -> BA deadlock
The reason we are holding bd_mutex for zram_remove is to prevent
any incoming open /dev/zram[0-9]. Otherwise, we could remove zram
others already have opened. But it causes above deadlock problem.
To fix the problem, this patch overrides block_device.open and
it returns -EBUSY if zram asserts he claims zram to reset so any
incoming open will be failed so we don't need to hold bd_mutex
for zram_remove ayn more.
This patch is to prepare for zram-add/remove feature.
[sergey.senozhatsky@gmail.com: simplify reset_store()]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Limiting the number of zram devices to 32 (default max_num_devices value)
is confusing, let's drop it. A user with 2TB or 4TB of RAM, for example,
can request as many devices as he can handle.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that zsmalloc supports compaction, zram can use it. For the first
step, this patch exports compact knob via sysfs so user can do compaction
via "echo 1 > /sys/block/zram0/compact".
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|