| Commit message (Collapse) | Author | Files | Lines |
|
DTS expectations
Buck and LDO binding name changes.
The binding names for the regulators have been changed to match the current
expectation from existing device tree source files.
This fix rectifies the disparity between what currently exists in some
.dts[i] board files and what is listed in this binding document. This
change re-aligns those differences and also brings the binding document
in-line with the expectations of the product datasheet from Dialog
Semiconductor.
Bucks and LDOs now follow the expected notation:
{ buck1, buck2, buck3, buck4 }
{ ldo1, ldo2, ldo3, ldo4, ldo5, ldo6, ldo7, ldo8, ldo9, ldo10 }
Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com>
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
This reverts commit 3c9fe8cdff1b889a059a30d22f130372f2b3885f.
As Miklos points out in commit c1b2cc1a765a, the "lookup_hash()" helper
is now unused, and in fact, with the hash salting changes, since the
hash of a dentry name now depends on the directory dentry it is in, the
helper function isn't even really likely to be useful.
So rather than keep it around in case somebody else might end up finding
a use for it, let's just remove the helper and not trick people into
thinking it might be a useful thing.
For example, I had obviously completely missed how the helper didn't
follow the normal dentry hashing patterns, and how the hash salting
patch broke overlayfs. Things would quietly build and look sane, but
not work.
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For PMD aligned (8M) hugepages, we currently allocate
all four page table levels which is wasteful. We now
allocate till PMD level only which saves memory usage
from page tables.
Also, when freeing page table for 8M hugepage backed region,
make sure we don't try to access non-existent PTE level.
Orabug: 22630259
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Change "Warning" to "warning" to make it look more like a GCC warning.
Hopefully that will be enough to help the 0-day bot or other automated
tools catch this warning earlier before it ends up in Linus's tree.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b1669f391a5db91040427fd9f8e1e79db18f9709.1469751119.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This fixes the following warning:
Warning: objtool: x86 instruction decoder differs from kernel
Unfortunately we have three identical copies of the x86 instruction
decoder in the kernel tree that have to be manually kept in sync.
It's on my TODO list to at least library-ize the ones in the tools
subdir so we'd only have two of them instead of three. In the meantime,
here's another manual sync.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: c61f4d5ebaf0 ("perf tools: Add AVX-512 support to the instruction decoder used by Intel PT")
Link: http://lkml.kernel.org/r/d7f74b4d91fed25b0be33cd5c86f5131fa1a7529.1469751119.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This fixes some false positive objtool warnings seen with gcc 6.1.1:
kernel/trace/ring_buffer.o: warning: objtool: ring_buffer_read_page()+0x36c: sibling call from callable instruction with changed frame pointer
arch/x86/kernel/reboot.o: warning: objtool: native_machine_emergency_restart()+0x139: sibling call from callable instruction with changed frame pointer
lib/xz/xz_dec_stream.o: warning: objtool: xz_dec_run()+0xc2: sibling call from callable instruction with changed frame pointer
With GCC 6, a new code pattern is sometimes used to access a switch
statement jump table in .rodata, which objtool doesn't yet recognize:
mov [rodata addr],%reg1
... some instructions ...
jmpq *(%reg1,%reg2,8)
Add support for detecting that pattern. The detection code is rather
crude, but it's still effective at weeding out false positives and
catching real warnings. It can be refined later once objtool starts
reading DWARF CFI.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b8c9503b4ad8c8a827cc5400db4c1b40a3ea07bc.1469751119.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Can be used by fuse, btrfs and f2fs to replace opencoded variants.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
FUSE_HAS_IOCTL_DIR should be assigned to ->flags, it may be a typo.
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 69fe05c90ed5 ("fuse: add missing INIT flags")
Cc: <stable@vger.kernel.org>
|
|
fuse_flush() calls write_inode_now() that triggers writeback, but actual
writeback will happen later, on fuse_sync_writes(). If an error happens,
fuse_writepage_end() will set error bit in mapping->flags. So, we have to
check mapping->flags after fuse_sync_writes().
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on")
Cc: <stable@vger.kernel.org> # v3.15+
|
|
Due to implementation of fuse writeback filemap_write_and_wait_range() does
not catch errors. We have to do this directly after fuse_sync_writes()
Signed-off-by: Alexey Kuznetsov <kuznet@virtuozzo.com>
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on")
Cc: <stable@vger.kernel.org> # v3.15+
|
|
In kernel bug 150021, a kernel panic was reported when restoring a
hibernate image. Only a picture of the oops was reported, so I can't
paste the whole thing here. But here are the most interesting parts:
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at ffff8804615cfd78
...
RIP: ffff8804615cfd78
RSP: ffff8804615f0000
RBP: ffff8804615cfdc0
...
Call Trace:
do_signal+0x23
exit_to_usermode_loop+0x64
...
The RIP is on the same page as RBP, so it apparently started executing
on the stack.
The bug was bisected to commit ef0f3ed5a4ac (x86/asm/power: Create
stack frames in hibernate_asm_64.S), which in retrospect seems quite
dangerous, since that code saves and restores the stack pointer from a
global variable ('saved_context').
There are a lot of moving parts in the hibernate save and restore paths,
so I don't know exactly what caused the panic. Presumably, a FRAME_END
was executed without the corresponding FRAME_BEGIN, or vice versa. That
would corrupt the return address on the stack and would be consistent
with the details of the above panic.
[ rjw: One major problem is that by the time the FRAME_BEGIN in
restore_registers() is executed, the stack pointer value may not
be valid any more. Namely, the stack area pointed to by it
previously may have been overwritten by some image memory contents
and that page frame may now be used for whatever different purpose
it had been allocated for before hibernation. In that case, the
FRAME_BEGIN will corrupt that memory. ]
Instead of doing the frame pointer save/restore around the bounds of the
affected functions, just do it around the call to swsusp_save().
That has the same effect of ensuring that if swsusp_save() sleeps, the
frame pointers will be correct. It's also a much more obviously safe
way to do it than the original patch. And objtool still doesn't report
any warnings.
Fixes: ef0f3ed5a4ac (x86/asm/power: Create stack frames in hibernate_asm_64.S)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=150021
Cc: 4.6+ <stable@vger.kernel.org> # 4.6+
Reported-by: Andre Reinke <andre.reinke@mailbox.org>
Tested-by: Andre Reinke <andre.reinke@mailbox.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The empty checking logic is duplicated in ovl_check_empty_and_clear() and
ovl_remove_and_whiteout(), except the condition for clearing whiteouts is
different:
ovl_check_empty_and_clear() checked for being upper
ovl_remove_and_whiteout() checked for merge OR lower
Move the intersection of those checks (upper AND merge) into
ovl_check_empty_and_clear() and simplify ovl_remove_and_whiteout().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
To make delete notification work on fa/inotify.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This does not work and does not make sense. So instead of fixing it
(probably not hard) just disallow.
Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
|
|
There's a superfluous newline in the warning message in ovl_d_real().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Right now we remove MAY_WRITE/MAY_APPEND bits from mask if realfile is on
lower/. This is done as files on lower will never be written and will be
copied up. But to copy up a file, mounter should have MAY_READ permission
otherwise copy up will fail. So set MAY_READ in mask when MAY_WRITE is
reset.
Dan Walsh noticed this when he did access(lowerfile, W_OK) and it returned
True (context mounts) but when he tried to actually write to file, it
failed as mounter did not have permission on lower file.
[SzM] don't set MAY_READ if only MAY_APPEND is set without MAY_WRITE; this
won't trigger a copy-up.
Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Right now if file is on lower/, we remove MAY_WRITE/MAY_APPEND bits from
mask as lower/ will never be written and file will be copied up. But this
is not true for special files. These files are not copied up and are opened
in place. So don't dilute the checks for these types of files.
Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Setting POSIX ACL needs special handling:
1) Some permission checks are done by ->setxattr() which now uses mounter's
creds ("ovl: do operations on underlying file system in mounter's
context"). These permission checks need to be done with current cred as
well.
2) Setting ACL can fail for various reasons. We do not need to copy up in
these cases.
In the mean time switch to using generic_setxattr.
[Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr()
doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is
disabled, and I could not come up with an obvious way to do it.
This instead avoids the link error by defining two sets of ACL operations
and letting the compiler drop one of the two at compile time depending
on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code,
also leading to smaller code.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Inode attributes are copied up to overlay inode (uid, gid, mode, atime,
mtime, ctime) so generic code using these fields works correcty. If a hard
link is created in overlayfs separate inodes are allocated for each link.
If chmod/chown/etc. is performed on one of the links then the inode
belonging to the other ones won't be updated.
This patch attempts to fix this by sharing inodes for hard links.
Use inode hash (with real inode pointer as a key) to make sure overlay
inodes are shared for hard links on upper. Hard links on lower are still
split (which is not user observable until the copy-up happens, see
Documentation/filesystems/overlayfs.txt under "Non-standard behavior").
The inode is only inserted in the hash if it is non-directoy and upper.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
To get from overlay inode to real inode we currently use 'struct
ovl_entry', which has lifetime connected to overlay dentry. This is okay,
since each overlay dentry had a new overlay inode allocated.
Following patch will break that assumption, so need to leave out ovl_entry.
This patch stores the real inode directly in i_private, with the lowest bit
used to indicate whether the inode is upper or lower.
Lifetime rules remain, using ovl_inode_real() must only be done while
caller holds ref on overlay dentry (and hence on real dentry), or within
RCU protected regions.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The error is due to RCU and is temporary.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Fix atime update logic in overlayfs.
This patch adds an i_op->update_time() handler to overlayfs inodes. This
forwards atime updates to the upper layer only. No atime updates are done
on lower layers.
Remove implicit atime updates to underlying files and directories with
O_NOATIME. Remove explicit atime update in ovl_readlink().
Clear atime related mnt flags from cloned upper mount. This means atime
updates are controlled purely by overlayfs mount options.
Reported-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When creating directory in workdir, the group/sgid inheritance from the
parent dir was omitted completely. Fix this by calling inode_init_owner()
on overlay inode and using the resulting uid/gid/mode to create the file.
Unfortunately the sgid bit can be stripped off due to umask, so need to
reset the mode in this case in workdir before moving the directory in
place.
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The fact that we always do permission checking on the overlay inode and
clear MAY_WRITE for checking access to the lower inode allows cruft to be
removed from ovl_permission().
1) "default_permissions" option effectively did generic_permission() on the
overlay inode with i_mode, i_uid and i_gid updated from underlying
filesystem. This is what we do by default now. It did the update using
vfs_getattr() but that's only needed if the underlying filesystem can
change (which is not allowed). We may later introduce a "paranoia_mode"
that verifies that mode/uid/gid are not changed.
2) splitting out the IS_RDONLY() check from inode_permission() also becomes
unnecessary once we remove the MAY_WRITE from the lower inode check.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Now we have two levels of checks in ovl_permission(). overlay inode
is checked with the creds of task while underlying inode is checked
with the creds of mounter.
Looks like mounter does not have to have WRITE access to files on lower/.
So remove the MAY_WRITE from access mask for checks on underlying
lower inode.
This means task should still have the MAY_WRITE permission on lower
inode and mounter is not required to have MAY_WRITE.
It also solves the problem of read only NFS mounts being used as lower.
If __inode_permission(lower_inode, MAY_WRITE) is called on read only
NFS, it fails. By resetting MAY_WRITE, check succeeds and case of
read only NFS shold work with overlay without having to specify any
special mount options (default permission).
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Given we are now doing checks both on overlay inode as well underlying
inode, we should be able to do checks and operations on underlying file
system using mounter's context.
So modify all operations to do checks/operations on underlying dentry/inode
in the context of mounter.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Right now ovl_permission() calls __inode_permission(realinode), to do
permission checks on real inode and no checks are done on overlay inode.
Modify it to do checks both on overlay inode as well as underlying inode.
Checks on overlay inode will be done with the creds of calling task while
checks on underlying inode will be done with the creds of mounter.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Now we are planning to do DAC permission checks on overlay inode
itself. And to make it work, we will need to make sure we can get acls from
underlying inode. So define ->get_acl() for overlay inodes and this in turn
calls into underlying filesystem to get acls, if any.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
ovl_create_upper() and ovl_create_over_whiteout() seem to be sharing some
common code which can be moved into a separate function. No functionality
change.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Previously this was only done for directory inodes. Doing so for all
inodes makes for a nice cleanup in ovl_permission at zero cost.
Inodes are not shared for hard links on the overlay, so this works fine.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
No point in keeping overlay inodes around since they will never be reused.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The hash salting changes meant that we can no longer reuse the hash in the
overlay dentry to look up the underlying dentry.
Instead of lookup_hash(), use lookup_one_len_unlocked() and swith to
mounter's creds (like we do for all other operations later in the series).
Now the lookup_hash() export introduced in 4.6 by 3c9fe8cdff1b ("vfs: add
lookup_hash() helper") is unused and can possibly be removed; its
usefulness negated by the hash salting and the idea that mounter's creds
should be used on operations on underlying filesystems.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 8387ff2577eb ("vfs: make the string hashes salt the hash")
|
|
The pio_dev[] array has MAX_NR_PIO_DEVICES elements so the > should be
>=.
Fixes: 5f97f7f9400d ('[PATCH] avr32 architecture')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
|
|
This patch swaps the mix of tabs and space for alignment of comment
after code to use spaces only.
Also document why recvmmsg was defined twice in the syscall_table.S
table, but only once in unistd.h. In short, wired in the table by
generic arch patch, but forgotten in unistd.h (review slip).
|
|
This patch wires up the new preadv2 and pwritev2 syscall on AVR32.
On AVR32, all parameters beyond the 5th are passed on the stack. System
calls don't use the stack -- they borrow a callee-saved register
instead. This means that syscalls that take 6 parameters must be called
through a stub that pushes the last parameter on the stack.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
|
|
do_sparc64_fault() calculates both the base and huge page RSS sizes and
uses this information in calls to tsb_grow(). The calculation for base
page TSB size is not correct if the task uses hugetlb pages. hugetlb
pages are not accounted for in RSS, therefore the call to get_mm_rss(mm)
does not include hugetlb pages. However, the number of pages based on
huge_pte_count (which does include hugetlb pages) is subtracted from
this value. This will result in an artificially small and often negative
RSS calculation. The base TSB size is then often set to max_tsb_size
as the passed RSS is unsigned, so a negative value looks really big.
THP pages are also accounted for in huge_pte_count, and THP pages are
accounted for in RSS so the calculation in do_sparc64_fault() is correct
if a task only uses THP pages.
A single huge_pte_count is not sufficient for TSB sizing if both hugetlb
and THP pages can be used. Instead of a single counter, use two: one
for hugetlb and one for THP.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Async compaction detects contention either due to failing trylock on
zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
compaction: khugepaged should not give up due to need_resched()") the
code got quite complicated to distinguish these two up to the
__alloc_pages_slowpath() level, so different decisions could be taken
for khugepaged allocations.
After the recent changes, khugepaged allocations don't check for
contended compaction anymore, so we again don't need to distinguish lock
and sched contention, and simplify the current convoluted code a lot.
However, I believe it's also possible to simplify even more and
completely remove the check for contended compaction after the initial
async compaction for costly orders, which was originally aimed at THP
page fault allocations. There are several reasons why this can be done
now:
- with the new defaults, THP page faults no longer do reclaim/compaction at
all, unless the system admin has overridden the default, or application has
indicated via madvise that it can benefit from THP's. In both cases, it
means that the potential extra latency is expected and worth the benefits.
- even if reclaim/compaction proceeds after this patch where it previously
wouldn't, the second compaction attempt is still async and will detect the
contention and back off, if the contention persists
- there are still heuristics like deferred compaction and pageblock skip bits
in place that prevent excessive THP page fault latencies
Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In the context of direct compaction, for some types of allocations we
would like the compaction to either succeed or definitely fail while
trying as hard as possible. Current async/sync_light migration mode is
insufficient, as there are heuristics such as caching scanner positions,
marking pageblocks as unsuitable or deferring compaction for a zone. At
least the final compaction attempt should be able to override these
heuristics.
To communicate how hard compaction should try, we replace migration mode
with a new enum compact_priority and change the relevant function
signatures. In compact_zone_order() where struct compact_control is
constructed, the priority is mapped to suitable control flags. This
patch itself has no functional change, as the current priority levels
are mapped back to the same migration modes as before. Expanding them
will be done next.
Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
removed, as the only caller exists under CONFIG_COMPACTION.
Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After the previous patch, we can distinguish costly allocations that
should be really lightweight, such as THP page faults, with
__GFP_NORETRY. This means we don't need to recognize khugepaged
allocations via PF_KTHREAD anymore. We can also change THP page faults
in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
khugepaged, as the process has indicated that it benefits from THP's and
is willing to pay some initial latency costs.
We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
__GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.
The patch effectively changes the current GFP_TRANSHUGE users as
follows:
* get_huge_zero_page() - the zero page lifetime should be relatively
long and it's shared by multiple users, so it's worth spending some
effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
This also restores direct reclaim to this allocation, which was
unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
by default to madvise and add a stall-free defrag option")
* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
is not an issue. So if khugepaged "defrag" is enabled (the default), do
reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
PF_KTHREAD check from page alloc.
As a side-effect, khugepaged will now no longer check if the initial
compaction was deferred or contended. This is OK, as khugepaged sleep
times between collapsion attempts are long enough to prevent noticeable
disruption, so we should allow it to spend some effort.
* migrate_misplaced_transhuge_page() - already was masking out
__GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
equivalent.
* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
are now allocating without __GFP_NORETRY. Other vma's keep using
__GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
it's allowed only for madvised vma's). The rest is conversion to
GFP_TRANSHUGE(_LIGHT).
[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since THP allocations during page faults can be costly, extra decisions
are employed for them to avoid excessive reclaim and compaction, if the
initial compaction doesn't look promising. The detection has never been
perfect as there is no gfp flag specific to THP allocations. At this
moment it checks the whole combination of flags that makes up
GFP_TRANSHUGE, and hopes that no other users of such combination exist,
or would mind being treated the same way. Extra care is also taken to
separate allocations from khugepaged, where latency doesn't matter that
much.
It is however possible to distinguish these allocations in a simpler and
more reliable way. The key observation is that after the initial
compaction followed by the first iteration of "standard"
reclaim/compaction, both __GFP_NORETRY allocations and costly
allocations without __GFP_REPEAT are declared as failures:
/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto nopage;
/*
* Do not retry costly high order allocations unless they are
* __GFP_REPEAT
*/
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
goto nopage;
This means we can further distinguish allocations that are costly order
*and* additionally include the __GFP_NORETRY flag. As it happens,
GFP_TRANSHUGE allocations do already fall into this category. This will
also allow other costly allocations with similar high-order benefit vs
latency considerations to use this semantic. Furthermore, we can
distinguish THP allocations that should try a bit harder (such as from
khugepageed) by removing __GFP_NORETRY, as will be done in the next
patch.
Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The retry loop in __alloc_pages_slowpath is supposed to keep trying
reclaim and compaction (and OOM), until either the allocation succeeds,
or returns with failure. Success here is more probable when reclaim
precedes compaction, as certain watermarks have to be met for compaction
to even try, and more free pages increase the probability of compaction
success. On the other hand, starting with light async compaction (if
the watermarks allow it), can be more efficient, especially for smaller
orders, if there's enough free memory which is just fragmented.
Thus, the current code starts with compaction before reclaim, and to
make sure that the last reclaim is always followed by a final
compaction, there's another direct compaction call at the end of the
loop. This makes the code hard to follow and adds some duplicated
handling of migration_mode decisions. It's also somewhat inefficient
that even if reclaim or compaction decides not to retry, the final
compaction is still attempted. Some gfp flags combination also shortcut
these retry decisions by "goto noretry;", making it even harder to
follow.
This patch attempts to restructure the code with only minimal functional
changes. The call to the first compaction and THP-specific checks are
now placed above the retry loop, and the "noretry" direct compaction is
removed.
The initial compaction is additionally restricted only to costly orders,
as we can expect smaller orders to be held back by watermarks, and only
larger orders to suffer primarily from fragmentation. This better
matches the checks in reclaim's shrink_zones().
There are two other smaller functional changes. One is that the upgrade
from async migration to light sync migration will always occur after the
initial compaction. This is how it has been until recent patch "mm,
oom: protect !costly allocations some more", which introduced upgrading
the mode based on COMPACT_COMPLETE result, but kept the final compaction
always upgraded, which made it even more special. It's better to return
to the simpler handling for now, as migration modes will be further
modified later in the series.
The second change is that once both reclaim and compaction declare it's
not worth to retry the reclaim/compact loop, there is no final
compaction attempt. As argued above, this is intentional. If that
final compaction were to succeed, it would be due to a wrong retry
decision, or simply a race with somebody else freeing memory for us.
The main outcome of this patch should be simpler code. Logically, the
initial compaction without reclaim is the exceptional case to the
reclaim/compaction scheme, but prior to the patch, it was the last loop
iteration that was exceptional. Now the code matches the logic better.
The change also enable the following patches.
Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
kswapd, it first tries get_page_from_freelist() with the new
alloc_flags, as it may succeed e.g. due to using min watermark instead
of low watermark. It makes sense to to do this attempt before adjusting
zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
path if we just wake up kswapd and successfully allocate.
This patch therefore moves the initial attempt above the retry label and
reorganizes a bit the part below the retry label. We still have to
attempt get_page_from_freelist() on each retry, as some allocations
cannot do that as part of direct reclaim or compaction, and yet are not
allowed to fail (even though they do a WARN_ON_ONCE() and thus should
not exist). We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
it. As a side-effect, the attempts from direct reclaim/compaction will
also no longer obey watermarks once this is set, but there's little harm
in that.
Kswapd wakeups are also done on each retry to be safe from potential
races resulting in kswapd going to sleep while a process (that may not
be able to reclaim by itself) is still looping.
Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
initialized, so move the initialization above the retry: label. Also
make the comment above the initialization more descriptive.
The only exception in the alloc_flags being constant is
ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
allocating thread. We can fix this, and make the code simpler and a bit
more effective at the same time, by moving the part that determines
ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().
This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
places in __alloc_pages_slowpath() anymore. The only two tests for the
flag can instead call gfp_pfmemalloc_allowed().
Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This (large, atomic) allocation attempt can fail. We expect and handle
that, so avoid the scary warning.
Link: http://lkml.kernel.org/r/20160720151905.GB19146@node.shutemov.name
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For KASAN builds:
- switch SLUB allocator to using stackdepot instead of storing the
allocation/deallocation stacks in the objects;
- change the freelist hook so that parts of the freelist can be put
into the quarantine.
[aryabinin@virtuozzo.com: fixes]
Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When looking up the nearest SLUB object for a given address, correctly
calculate its offset if SLAB_RED_ZONE is enabled for that cache.
Previously, when KASAN had detected an error on an object from a cache
with SLAB_RED_ZONE set, the actual start address of the object was
miscalculated, which led to random stacks having been reported.
When looking up the nearest SLUB object for a given address, correctly
calculate its offset if SLAB_RED_ZONE is enabled for that cache.
Fixes: 7ed2f9e663854db ("mm, kasan: SLAB support")
Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There's one case when vma_adjust() expands the vma, overlapping with
*two* next vma. See case 6 of mprotect, described in the comment to
vma_merge().
To handle this (and only this) situation we iterate twice over main part
of the function. See "goto again".
Vegard reported[1] that he sees out-of-bounds access complain from
KASAN, if anon_vma_clone() on the *second* iteration fails.
This happens because we free 'next' vma by the end of first iteration
and don't have a way to undo this if anon_vma_clone() fails on the
second iteration.
The solution is to do all required allocations upfront, before we touch
vmas.
The allocation on the second iteration is only required if first two
vmas don't have anon_vma, but third does. So we need, in total, one
anon_vma_clone() call.
It's easy to adjust 'exporter' to the third vma for such case.
[1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com
Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|