summaryrefslogtreecommitdiffstats
path: root/kernel (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'spdx-5.7-rc1' of ↵Linus Torvalds2020-04-032-3/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull SPDX updates from Greg KH: "Here are three SPDX patches for 5.7-rc1. One fixes up the SPDX tag for a single driver, while the other two go through the tree and add SPDX tags for all of the .gitignore files as needed. Nothing too complex, but you will get a merge conflict with your current tree, that should be trivial to handle (one file modified by two things, one file deleted.) All three of these have been in linux-next for a while, with no reported issues other than the merge conflict" * tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: ASoC: MT6660: make spdxcheck.py happy .gitignore: add SPDX License Identifier .gitignore: remove too obvious comments
| * .gitignore: add SPDX License IdentifierMasahiro Yamada2020-03-252-0/+2
| | | | | | | | | | | | | | Add SPDX License Identifier to all .gitignore files. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * .gitignore: remove too obvious commentsMasahiro Yamada2020-03-251-3/+0
| | | | | | | | | | | | | | | | | | Some .gitignore files have comments like "Generated files", "Ignore generated files" at the header part, but they are too obvious. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge branch 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2020-04-031-8/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | Pull workqueue updates from Tejun Heo: "Nothing too interesting. Just two trivial patches" * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Mark up unlocked access to wq->first_flusher workqueue: Make workqueue_init*() return void
| * | workqueue: Mark up unlocked access to wq->first_flusherChris Wilson2020-03-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ 7329.671518] BUG: KCSAN: data-race in flush_workqueue / flush_workqueue [ 7329.671549] [ 7329.671572] write to 0xffff8881f65fb250 of 8 bytes by task 37173 on cpu 2: [ 7329.671607] flush_workqueue+0x3bc/0x9b0 (kernel/workqueue.c:2844) [ 7329.672527] [ 7329.672540] read to 0xffff8881f65fb250 of 8 bytes by task 37175 on cpu 0: [ 7329.672571] flush_workqueue+0x28d/0x9b0 (kernel/workqueue.c:2835) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | workqueue: Make workqueue_init*() return voidYu Chen2020-03-041-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The return values of workqueue_init() and workqueue_early_int() are always 0, and there is no usage of their return value. So just make them return void. Signed-off-by: Yu Chen <chen.yu@easystack.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | Merge branch 'for-5.7' of ↵Linus Torvalds2020-04-035-123/+314
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Christian extended clone3 so that processes can be spawned into cgroups directly. This is not only neat in terms of semantics but also avoids grabbing the global cgroup_threadgroup_rwsem for migration. - Daniel added !root xattr support to cgroupfs. Userland already uses xattrs on cgroupfs for bookkeeping. This will allow delegated cgroups to support such usages. - Prateek tried to make cpuset hotplug handling synchronous but that led to possible deadlock scenarios. Reverted. - Other minor changes including release_agent_path handling cleanup. * 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1: Document the cpuset_v2_mode mount option Revert "cpuset: Make cpuset hotplug synchronous" cgroupfs: Support user xattrs kernfs: Add option to enable user xattrs kernfs: Add removed_size out param for simple_xattr_set kernfs: kvmalloc xattr value instead of kmalloc cgroup: Restructure release_agent_path handling selftests/cgroup: add tests for cloning into cgroups clone3: allow spawning processes into cgroups cgroup: add cgroup_may_write() helper cgroup: refactor fork helpers cgroup: add cgroup_get_from_file() helper cgroup: unify attach permission checking cpuset: Make cpuset hotplug synchronous cgroup.c: Use built-in RCU list checking kselftest/cgroup: add cgroup destruction test cgroup: Clean up css_set task traversal
| * | | docs: cgroup-v1: Document the cpuset_v2_mode mount optionWaiman Long2020-04-031-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cpuset in cgroup v1 accepts a special "cpuset_v2_mode" mount option that make cpuset.cpus and cpuset.mems behave more like those in cgroup v2. Document it to make other people more aware of this feature that can be useful in some circumstances. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | Revert "cpuset: Make cpuset hotplug synchronous"Tejun Heo2020-04-032-19/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") as it may deadlock with cpu hotplug path. Link: http://lkml.kernel.org/r/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Qian Cai <cai@lca.pw> Cc: Prateek Sood <prsood@codeaurora.org>
| * | | cgroupfs: Support user xattrsDaniel Xu2020-03-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch turns on xattr support for cgroupfs. This is useful for letting non-root owners of delegated subtrees attach metadata to cgroups. One use case is for subtree owners to tell a userspace out of memory killer to bias away from killing specific subtrees. Tests: [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice --remove user.name$i; done setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device `seq 0 130` is inclusive, and 131 - 128 = 3, which is the number of errors we expect to see. [/data]# cat testxattr.c #include <sys/types.h> #include <sys/xattr.h> #include <stdio.h> #include <stdlib.h> int main() { char name[256]; char *buf = malloc(64 << 10); if (!buf) { perror("malloc"); return 1; } for (int i = 0; i < 4; ++i) { snprintf(name, 256, "user.bigone%d", i); if (setxattr("/sys/fs/cgroup/system.slice", name, buf, 64 << 10, 0)) { printf("setxattr failed on iteration=%d\n", i); return 1; } } return 0; } [/data]# ./a.out setxattr failed on iteration=2 [/data]# ./a.out setxattr failed on iteration=0 [/sys/fs/cgroup]# setfattr -x user.bigone0 system.slice/ [/sys/fs/cgroup]# setfattr -x user.bigone1 system.slice/ [/data]# ./a.out setxattr failed on iteration=2 Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: Restructure release_agent_path handlingTejun Heo2020-03-121-17/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgrp->root->release_agent_path is protected by both cgroup_mutex and release_agent_path_lock and readers can hold either one. The dual-locking scheme was introduced while breaking a locking dependency issue around cgroup_mutex but doesn't make sense anymore given that the only remaining reader which uses cgroup_mutex is cgroup1_releaes_agent(). This patch updates cgroup1_release_agent() to use release_agent_path_lock so that release_agent_path is always protected only by release_agent_path_lock. While at it, convert strlen() based empty string checks to direct tests on the first character as suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>
| * | | Merge branch 'for-5.6-fixes' into for-5.7Tejun Heo2020-03-122-4/+4
| |\ \ \
| * | | | clone3: allow spawning processes into cgroupsChristian Brauner2020-02-123-31/+188
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds support for creating a process in a different cgroup than its parent. Callers can limit and account processes and threads right from the moment they are spawned: - A service manager can directly spawn new services into dedicated cgroups. - A process can be directly created in a frozen cgroup and will be frozen as well. - The initial accounting jitter experienced by process supervisors and daemons is eliminated with this. - Threaded applications or even thread implementations can choose to create a specific cgroup layout where each thread is spawned directly into a dedicated cgroup. This feature is limited to the unified hierarchy. Callers need to pass a directory file descriptor for the target cgroup. The caller can choose to pass an O_PATH file descriptor. All usual migration restrictions apply, i.e. there can be no processes in inner nodes. In general, creating a process directly in a target cgroup adheres to all migration restrictions. One of the biggest advantages of this feature is that CLONE_INTO_GROUP does not need to grab the write side of the cgroup cgroup_threadgroup_rwsem. This global lock makes moving tasks/threads around super expensive. With clone3() this lock is avoided. Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: cgroups@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | cgroup: add cgroup_may_write() helperChristian Brauner2020-02-121-7/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a cgroup_may_write() helper which we can use in the CLONE_INTO_CGROUP patch series to verify that we can write to the destination cgroup. Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | cgroup: refactor fork helpersChristian Brauner2020-02-122-23/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This refactors the fork helpers so they can be easily modified in the next patches. The patch just moves the cgroup threadgroup rwsem grab and release into the helpers. They don't need to be directly exposed in fork.c. Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | cgroup: add cgroup_get_from_file() helperChristian Brauner2020-02-121-11/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a helper cgroup_get_from_file(). The helper will be used in subsequent patches to retrieve a cgroup while holding a reference to the struct file it was taken from. Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | cgroup: unify attach permission checkingChristian Brauner2020-02-121-14/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The core codepaths to check whether a process can be attached to a cgroup are the same for threads and thread-group leaders. Only a small piece of code verifying that source and destination cgroup are in the same domain differentiates the thread permission checking from thread-group leader permission checking. Since cgroup_migrate_vet_dst() only matters cgroup2 - it is a noop on cgroup1 - we can move it out of cgroup_attach_task(). All checks can now be consolidated into a new helper cgroup_attach_permissions() callable from both cgroup_procs_write() and cgroup_threads_write(). Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | cpuset: Make cpuset hotplug synchronousPrateek Sood2020-02-122-14/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert cpuset_hotplug_workfn() into synchronous call for cpu hotplug path. For memory hotplug path it still gets queued as a work item. Since cpuset_hotplug_workfn() can be made synchronous for cpu hotplug path, it is not required to wait for cpuset hotplug while thawing processes. Signed-off-by: Prateek Sood <prsood@codeaurora.org> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | cgroup.c: Use built-in RCU list checkingMadhuparna Bhowmik2020-02-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | list_for_each_entry_rcu has built-in RCU and lock checking. Pass cond argument to list_for_each_entry_rcu() to silence false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled by default. Even though the function css_next_child() already checks if cgroup_mutex or rcu_read_lock() is held using cgroup_assert_mutex_or_rcu_locked(), there is a need to pass cond to list_for_each_entry_rcu() to avoid false positive lockdep warning. Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | cgroup: Clean up css_set task traversalMichal Koutný2020-02-121-33/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | css_task_iter stores pointer to head of each iterable list, this dates back to commit 0f0a2b4fa621 ("cgroup: reorganize css_task_iter") when we did not store cur_cset. Let us utilize list heads directly in cur_cset and streamline css_task_iter_advance_css_set a bit. This is no intentional function change. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | | | Merge tag 'kgdb-5.7-rc1' of ↵Linus Torvalds2020-04-031-8/+12
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb updates from Daniel Thompson: "Pretty quiet this cycle. Just a couple of small fixes from myself both of which were reviewed by Doug Anderson to keep me honest (thanks)" * tag 'kgdb-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: Censor attempts to set PROMPT without ENABLE_MEM_READ kdb: Eliminate strncpy() warnings by replacing with strscpy()
| * | | | | kdb: Censor attempts to set PROMPT without ENABLE_MEM_READDaniel Thompson2020-04-011-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the PROMPT variable could be abused to provoke the printf() machinery to read outside the current stack frame. Normally this doesn't matter becaues md is already a much better tool for reading from memory. However the md command can be disabled by not setting KDB_ENABLE_MEM_READ. Let's also prevent PROMPT from being modified in these circumstances. Whilst adding a comment to help future code reviewers we also remove the #ifdef where PROMPT in consumed. There is no problem passing an unused (0) to snprintf when !CONFIG_SMP. argument Reported-by: Wang Xiayang <xywang.sjtu@sjtu.edu.cn> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Reviewed-by: Douglas Anderson <dianders@chromium.org>
| * | | | | kdb: Eliminate strncpy() warnings by replacing with strscpy()Daniel Thompson2020-04-011-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the code to manage the kdb history buffer uses strncpy() to copy strings to/and from the history and exhibits the classic "but nobody ever told me that strncpy() doesn't always terminate strings" bug. Modern gcc compilers recognise this bug and issue a warning. In reality these calls will only abridge the copied string if kdb_read() has *already* overflowed the command buffer. Thus the use of counted copies here is only used to reduce the secondary effects of a bug elsewhere in the code. Therefore transitioning these calls into strscpy() (without checking the return code) is appropriate. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Reviewed-by: Douglas Anderson <dianders@chromium.org>
* | | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2020-04-023-6/+49
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge updates from Andrew Morton: "A large amount of MM, plenty more to come. Subsystems affected by this patch series: - tools - kthread - kbuild - scripts - ocfs2 - vfs - mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap, sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy, hugetlbfs, hugetlb" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits) include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS selftests/vm: fix map_hugetlb length used for testing read and write mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge() mm/hugetlb.c: clean code by removing unnecessary initialization hugetlb_cgroup: add hugetlb_cgroup reservation docs hugetlb_cgroup: add hugetlb_cgroup reservation tests hugetlb: support file_region coalescing again hugetlb_cgroup: support noreserve mappings hugetlb_cgroup: add accounting for shared mappings hugetlb: disable region_add file_region coalescing hugetlb_cgroup: add reservation accounting for private mappings mm/hugetlb_cgroup: fix hugetlb_cgroup migration hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations hugetlb_cgroup: add hugetlb_cgroup reservation counter hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization mm/memblock.c: remove redundant assignment to variable max_addr mm: mempolicy: require at least one nodeid for MPOL_PREFERRED mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk() ...
| * | | | | | mm/compaction: Disable compact_unevictable_allowed on RTSebastian Andrzej Siewior2020-04-021-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages") it is allowed to examine mlocked pages and compact them by default. On -RT even minor pagefaults are problematic because it may take a few 100us to resolve them and until then the task is blocked. Make compact_unevictable_allowed = 0 default and issue a warning on RT if it is changed. [bigeasy@linutronix.de: v5] Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | mm/compaction: really limit compact_unevictable_allowed to 0 and 1Sebastian Andrzej Siewior2020-04-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The proc file `compact_unevictable_allowed' should allow 0 and 1 only, the `extra*' attribues have been set properly but without proc_dointvec_minmax() as the `proc_handler' the limit will not be enforced. Use proc_dointvec_minmax() as the `proc_handler' to enfoce the valid specified range. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Link: http://lkml.kernel.org/r/20200303202054.gsosv7fsx2ma3cic@linutronix.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | mm: memcontrol: recursive memory.low protectionJohannes Weiner2020-04-021-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, the effective protection of any given cgroup is capped by its own explicit memory.low setting, regardless of what the parent says. The reasons for this are mostly historical and ease of implementation: to make delegation of memory.low safe, effective protection is the min() of all memory.low up the tree. Unfortunately, this limitation makes it impossible to protect an entire subtree from another without forcing the user to make explicit protection allocations all the way to the leaf cgroups - something that is highly undesirable in real life scenarios. Consider memory in a data center host. At the cgroup top level, we have a distinction between system management software and the actual workload the system is executing. Both branches are further subdivided into individual services, job components etc. We want to protect the workload as a whole from the system management software, but that doesn't mean we want to protect and prioritize individual workload wrt each other. Their memory demand can vary over time, and we'd want the VM to simply cache the hottest data within the workload subtree. Yet, the current memory.low limitations force us to allocate a fixed amount of protection to each workload component in order to get protection from system management software in general. This results in very inefficient resource distribution. Another concern with mandating downward allocation is that, as the complexity of the cgroup tree grows, it gets harder for the lower levels to be informed about decisions made at the host-level. Consider a container inside a namespace that in turn creates its own nested tree of cgroups to run multiple workloads. It'd be extremely difficult to configure memory.low parameters in those leaf cgroups that on one hand balance pressure among siblings as the container desires, while also reflecting the host-level protection from e.g. rpm upgrades, that lie beyond one or more delegation and namespacing points in the tree. It's highly unusual from a cgroup interface POV that nested levels have to be aware of and reflect decisions made at higher levels for them to be effective. To enable such use cases and scale configurability for complex trees, this patch implements a resource inheritance model for memory that is similar to how the CPU and the IO controller implement work-conserving resource allocations: a share of a resource allocated to a subree always applies to the entire subtree recursively, while allowing, but not mandating, children to further specify distribution rules. That means that if protection is explicitly allocated among siblings, those configured shares are being followed during page reclaim just like they are now. However, if the memory.low set at a higher level is not fully claimed by the children in that subtree, the "floating" remainder is applied to each cgroup in the tree in proportion to its size. Since reclaim pressure is applied in proportion to size as well, each child in that tree gets the same boost, and the effect is neutral among siblings - with respect to each other, they behave as if no memory control was enabled at all, and the VM simply balances the memory demands optimally within the subtree. But collectively those cgroups enjoy a boost over the cgroups in neighboring trees. E.g. a leaf cgroup with a memory.low setting of 0 no longer means that it's not getting a share of the hierarchically assigned resource, just that it doesn't claim a fixed amount of it to protect from its siblings. This allows us to recursively protect one subtree (workload) from another (system management), while letting subgroups compete freely among each other - without having to assign fixed shares to each leaf, and without nested groups having to echo higher-level settings. The floating protection composes naturally with fixed protection. Consider the following example tree: A A: low = 2G / \ A1: low = 1G A1 A2 A2: low = 0G As outside pressure is applied to this tree, A1 will enjoy a fixed protection from A2 of 1G, but the remaining, unclaimed 1G from A is split evenly among A1 and A2, coming out to 1.5G and 0.5G. There is a slight risk of regressing theoretical setups where the top-level cgroups don't know about the true budgeting and set bogusly high "bypass" values that are meaningfully allocated down the tree. Such setups would rely on unclaimed protection to be discarded, and distributing it would change the intended behavior. Be safe and hide the new behavior behind a mount option, 'memory_recursiveprot'. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()Roman Gushchin2020-04-021-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page() to better reflect what they are actually doing: 1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge the current memcg 2) set or clear the PageKmemcg flag Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2020-04-028-45/+39
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exec/proc updates from Eric Biederman: "This contains two significant pieces of work: the work to sort out proc_flush_task, and the work to solve a deadlock between strace and exec. Fixing proc_flush_task so that it no longer requires a persistent mount makes improvements to proc possible. The removal of the persistent mount solves an old regression that that caused the hidepid mount option to only work on remount not on mount. The regression was found and reported by the Android folks. This further allows Alexey Gladkov's work making proc mount options specific to an individual mount of proc to move forward. The work on exec starts solving a long standing issue with exec that it takes mutexes of blocking userspace applications, which makes exec extremely deadlock prone. For the moment this adds a second mutex with a narrower scope that handles all of the easy cases. Which makes the tricky cases easy to spot. With a little luck the code to solve those deadlocks will be ready by next merge window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits) signal: Extend exec_id to 64bits pidfd: Use new infrastructure to fix deadlocks in execve perf: Use new infrastructure to fix deadlocks in execve proc: io_accounting: Use new infrastructure to fix deadlocks in execve proc: Use new infrastructure to fix deadlocks in execve kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve kernel: doc: remove outdated comment cred.c mm: docs: Fix a comment in process_vm_rw_core selftests/ptrace: add test cases for dead-locks exec: Fix a deadlock in strace exec: Add exec_update_mutex to replace cred_guard_mutex exec: Move exec_mmap right after de_thread in flush_old_exec exec: Move cleanup of posix timers on exec out of de_thread exec: Factor unshare_sighand out of de_thread and call it separately exec: Only compute current once in flush_old_exec pid: Improve the comment about waiting in zap_pid_ns_processes proc: Remove the now unnecessary internal mount of proc uml: Create a private mount of proc for mconsole uml: Don't consult current to find the proc_mnt in mconsole_proc proc: Use a list of inodes to flush from proc ...
| * | | | | | | signal: Extend exec_id to 64bitsEric W. Biederman2020-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | | | pidfd: Use new infrastructure to fix deadlocks in execveBernd Edlinger2020-03-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes __pidfd_fget to use the new exec_update_mutex instead of cred_guard_mutex. This should be safe, as the credentials do not change before exec_update_mutex is locked. Therefore whatever file access is possible with holding the cred_guard_mutex here is also possbile with the exec_update_mutex. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | | | | | | perf: Use new infrastructure to fix deadlocks in execveBernd Edlinger2020-03-251-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes perf_event_set_clock to use the new exec_update_mutex instead of cred_guard_mutex. This should be safe, as the credentials are only used for reading. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | | | | | | kernel/kcmp.c: Use new infrastructure to fix deadlocks in execveBernd Edlinger2020-03-251-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes kcmp_epoll_target to use the new exec_update_mutex instead of cred_guard_mutex. This should be safe, as the credentials are only used for reading, and furthermore ->mm and ->sighand are updated on execve, but only under the new exec_update_mutex. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | | | | | | kernel: doc: remove outdated comment cred.cBernd Edlinger2020-03-251-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes an outdated comment in prepare_kernel_cred. There is no "cred_replace_mutex" any more, so the comment must go away. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | | | | | | exec: Fix a deadlock in straceBernd Edlinger2020-03-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running. I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access. The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This changes mm_access to use the new exec_update_mutex instead of cred_guard_mutex. This patch is based on the following patch by Eric W. Biederman: "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks" Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/ Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | | | | | | exec: Add exec_update_mutex to replace cred_guard_mutexEric W. Biederman2020-03-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possible indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/ Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | | | | | | pid: Improve the comment about waiting in zap_pid_ns_processesEric W. Biederman2020-02-281-12/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Oleg wrote a very informative comment, but with the removal of proc_cleanup_work it is no longer accurate. Rewrite the comment so that it only talks about the details that are still relevant, and hopefully is a little clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | | | proc: Remove the now unnecessary internal mount of procEric W. Biederman2020-02-282-15/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There remains no more code in the kernel using pids_ns->proc_mnt, therefore remove it from the kernel. The big benefit of this change is that one of the most error prone and tricky parts of the pid namespace implementation, maintaining kernel mounts of proc is removed. In addition removing the unnecessary complexity of the kernel mount fixes a regression that caused the proc mount options to be ignored. Now that the initial mount of proc comes from userspace, those mount options are again honored. This fixes Android's usage of the proc hidepid option. Reported-by: Alistair Strachan <astrachan@google.com> Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | | | proc: Use a list of inodes to flush from procEric W. Biederman2020-02-242-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework the flushing of proc to use a list of directory inodes that need to be flushed. The list is kept on struct pid not on struct task_struct, as there is a fixed connection between proc inodes and pids but at least for the case of de_thread the pid of a task_struct changes. This removes the dependency on proc_mnt which allows for different mounts of proc having different mount options even in the same pid namespace and this allows for the removal of proc_mnt which will trivially the first mount of proc to honor it's mount options. This flushing remains an optimization. The functions pid_delete_dentry and pid_revalidate ensure that ordinary dcache management will not attempt to use dentries past the point their respective task has died. When unused the shrinker will eventually be able to remove these dentries. There is a case in de_thread where proc_flush_pid can be called early for a given pid. Which winds up being safe (if suboptimal) as this is just an optiimization. Only pid directories are put on the list as the other per pid files are children of those directories and d_invalidate on the directory will get them as well. So that the pid can be used during flushing it's reference count is taken in release_task and dropped in proc_flush_pid. Further the call of proc_flush_pid is moved after the tasklist_lock is released in release_task so that it is certain that the pid has already been unhashed when flushing it taking place. This removes a small race where a dentry could recreated. As struct pid is supposed to be small and I need a per pid lock I reuse the only lock that currently exists in struct pid the the wait_pidfd.lock. The net result is that this adds all of this functionality with just a little extra list management overhead and a single extra pointer in struct pid. v2: Initialize pid->inodes. I somehow failed to get that initialization into the initial version of the patch. A boot failure was reported by "kernel test robot <lkp@intel.com>", and failure to initialize that pid->inodes matches all of the reported symptoms. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* | | | | | | | Merge branch 'linus' of ↵Linus Torvalds2020-04-011-4/+5
|\ \ \ \ \ \ \ \ | |_|/ / / / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Fix out-of-sync IVs in self-test for IPsec AEAD algorithms Algorithms: - Use formally verified implementation of x86/curve25519 Drivers: - Enhance hwrng support in caam - Use crypto_engine for skcipher/aead/rsa/hash in caam - Add Xilinx AES driver - Add uacce driver - Register zip engine to uacce in hisilicon - Add support for OCTEON TX CPT engine in marvell" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (162 commits) crypto: af_alg - bool type cosmetics crypto: arm[64]/poly1305 - add artifact to .gitignore files crypto: caam - limit single JD RNG output to maximum of 16 bytes crypto: caam - enable prediction resistance in HRWNG bus: fsl-mc: add api to retrieve mc version crypto: caam - invalidate entropy register during RNG initialization crypto: caam - check if RNG job failed crypto: caam - simplify RNG implementation crypto: caam - drop global context pointer and init_done crypto: caam - use struct hwrng's .init for initialization crypto: caam - allocate RNG instantiation descriptor with GFP_DMA crypto: ccree - remove duplicated include from cc_aead.c crypto: chelsio - remove set but not used variable 'adap' crypto: marvell - enable OcteonTX cpt options for build crypto: marvell - add the Virtual Function driver for CPT crypto: marvell - add support for OCTEON TX CPT engine crypto: marvell - create common Kconfig and Makefile for Marvell crypto: arm/neon - memzero_explicit aes-cbc key crypto: bcm - Use scnprintf() for avoiding potential buffer overflow crypto: atmel-i2c - Fix wakeup fail ...
| * | | | | | | crypto: pcrypt - simplify error handling in pcrypt_create_aead()Eric Biggers2020-03-061-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify the error handling in pcrypt_create_aead() by taking advantage of crypto_grab_aead() now handling an ERR_PTR() name and by taking advantage of crypto_drop_aead() now accepting (as a no-op) a spawn that hasn't been grabbed yet. This required also making padata_free_shell() accept a NULL argument. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
| * | | | | | | padata: fix uninitialized return value in padata_replace()Daniel Jordan2020-02-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to Geert's report[0], kernel/padata.c: warning: 'err' may be used uninitialized in this function [-Wuninitialized]: => 539:2 Warning is seen only with older compilers on certain archs. The runtime effect is potentially returning garbage down the stack when padata's cpumasks are modified before any pcrypt requests have run. Simplest fix is to initialize err to the success value. [0] http://lkml.kernel.org/r/20200210135506.11536-1-geert@linux-m68k.org Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: bbefa1dd6a6d ("crypto: pcrypt - Avoid deadlock by using per-instance padata queues") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds2020-04-0125-1076/+2676
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: "Highlights: 1) Fix the iwlwifi regression, from Johannes Berg. 2) Support BSS coloring and 802.11 encapsulation offloading in hardware, from John Crispin. 3) Fix some potential Spectre issues in qtnfmac, from Sergey Matyukevich. 4) Add TTL decrement action to openvswitch, from Matteo Croce. 5) Allow paralleization through flow_action setup by not taking the RTNL mutex, from Vlad Buslov. 6) A lot of zero-length array to flexible-array conversions, from Gustavo A. R. Silva. 7) Align XDP statistics names across several drivers for consistency, from Lorenzo Bianconi. 8) Add various pieces of infrastructure for offloading conntrack, and make use of it in mlx5 driver, from Paul Blakey. 9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki. 10) Lots of parallelization improvements during configuration changes in mlxsw driver, from Ido Schimmel. 11) Add support to devlink for generic packet traps, which report packets dropped during ACL processing. And use them in mlxsw driver. From Jiri Pirko. 12) Support bcmgenet on ACPI, from Jeremy Linton. 13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei Starovoitov, and your's truly. 14) Support XDP meta-data in virtio_net, from Yuya Kusakabe. 15) Fix sysfs permissions when network devices change namespaces, from Christian Brauner. 16) Add a flags element to ethtool_ops so that drivers can more simply indicate which coalescing parameters they actually support, and therefore the generic layer can validate the user's ethtool request. Use this in all drivers, from Jakub Kicinski. 17) Offload FIFO qdisc in mlxsw, from Petr Machata. 18) Support UDP sockets in sockmap, from Lorenz Bauer. 19) Fix stretch ACK bugs in several TCP congestion control modules, from Pengcheng Yang. 20) Support virtual functiosn in octeontx2 driver, from Tomasz Duszynski. 21) Add region operations for devlink and use it in ice driver to dump NVM contents, from Jacob Keller. 22) Add support for hw offload of MACSEC, from Antoine Tenart. 23) Add support for BPF programs that can be attached to LSM hooks, from KP Singh. 24) Support for multiple paths, path managers, and counters in MPTCP. From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti, and others. 25) More progress on adding the netlink interface to ethtool, from Michal Kubecek" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits) net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline cxgb4/chcr: nic-tls stats in ethtool net: dsa: fix oops while probing Marvell DSA switches net/bpfilter: remove superfluous testing message net: macb: Fix handling of fixed-link node net: dsa: ksz: Select KSZ protocol tag netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write net: stmmac: add EHL 2.5Gbps PCI info and PCI ID net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID net: stmmac: create dwmac-intel.c to contain all Intel platform net: dsa: bcm_sf2: Support specifying VLAN tag egress rule net: dsa: bcm_sf2: Add support for matching VLAN TCI net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT net: dsa: bcm_sf2: Disable learning for ASP port net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge net: dsa: b53: Prevent tagged VLAN on port 7 for 7278 net: dsa: b53: Restore VLAN entries upon (re)configuration net: dsa: bcm_sf2: Fix overflow checks hv_netvsc: Remove unnecessary round_up for recv_completion_cnt ...
| * \ \ \ \ \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2020-03-3113-748/+1872
| |\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | | | | bpf: Implement bpf_prog replacement for an active bpf_cgroup_linkAndrii Nakryiko2020-03-313-0/+162
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from under given bpf_link. Currently this is only supported for bpf_cgroup_link, but will be extended to other kinds of bpf_links in follow-up patches. For bpf_cgroup_link, implemented functionality matches existing semantics for direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either unconditionally set new bpf_prog regardless of which bpf_prog is currently active under given bpf_link, or, optionally, can specify expected active bpf_prog. If active bpf_prog doesn't match expected one, no changes are performed, old bpf_link stays intact and attached, operation returns a failure. cgroup_bpf_replace() operation is resolving race between auto-detachment and bpf_prog update in the same fashion as it's done for bpf_link detachment, except in this case update has no way of succeeding because of target cgroup marked as dying. So in this case error is returned. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
| | * | | | | | | | bpf: Implement bpf_link-based cgroup BPF program attachmentAndrii Nakryiko2020-03-313-90/+303
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement new sub-command to attach cgroup BPF programs and return FD-based bpf_link back on success. bpf_link, once attached to cgroup, cannot be replaced, except by owner having its FD. Cgroup bpf_link supports only BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI attachments can be freely intermixed. To prevent bpf_cgroup_link from keeping cgroup alive past the point when no BPF program can be executed, implement auto-detachment of link. When cgroup_bpf_release() is called, all attached bpf_links are forced to release cgroup refcounts, but they leave bpf_link otherwise active and allocated, as well as still owning underlying bpf_prog. This is because user-space might still have FDs open and active, so bpf_link as a user-referenced object can't be freed yet. Once last active FD is closed, bpf_link will be freed and underlying bpf_prog refcount will be dropped. But cgroup refcount won't be touched, because cgroup is released already. The inherent race between bpf_cgroup_link release (from closing last FD) and cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So the only additional check required is when bpf_cgroup_link attempts to detach itself from cgroup. At that time we need to check whether there is still cgroup associated with that link. And if not, exit with success, because bpf_cgroup_link was already successfully detached. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
| | * | | | | | | | bpf: Verifier, refine 32bit bound in do_refine_retval_rangeJohn Fastabend2020-03-311-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Further refine return values range in do_refine_retval_range by noting these are int return types (We will assume here that int is a 32-bit type). Two reasons to pull this out of original patch. First it makes the original fix impossible to backport. And second I've not seen this as being problematic in practice unlike the other case. Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560421952.10843.12496354931526965046.stgit@john-Precision-5820-Tower
| | * | | | | | | | bpf: Verifier, do explicit ALU32 bounds trackingJohn Fastabend2020-03-302-281/+852
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is not possible for the current verifier to track ALU32 and JMP ops correctly. This can result in the verifier aborting with errors even though the program should be verifiable. BPF codes that hit this can work around it by changin int variables to 64-bit types, marking variables volatile, etc. But this is all very ugly so it would be better to avoid these tricks. But, the main reason to address this now is do_refine_retval_range() was assuming return values could not be negative. Once we fixed this code that was previously working will no longer work. See do_refine_retval_range() patch for details. And we don't want to suddenly cause programs that used to work to fail. The simplest example code snippet that illustrates the problem is likely this, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] The expected 64-bit and 32-bit bounds after each line are shown on the right. The current issue is without the w* bounds we are forced to use the worst case bound of [0, U32_MAX]. To resolve this type of case, jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit 32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then from branch_taken logic creating new bounds we can track 32-bit bounds explicitly. The next case we observed is ALU ops after the jmp32, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] 55: w8 += 1 // r8 <- [0, U32_MAX+1] // w8 <- [0, X+1] In order to keep the bounds accurate at this point we also need to track ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU ops, mov, add, sub, etc. Finally there is a question of how and when to merge bounds. The cases enumerate here, 1. MOV ALU32 - zext 32-bit -> 64-bit 2. MOV ALU64 - copy 64-bit -> 32-bit 3. op ALU32 - zext 32-bit -> 64-bit 4. op ALU64 - n/a 5. jmp ALU32 - 64-bit: var32_off | upper_32_bits(var64_off) 6. jmp ALU64 - 32-bit: (>> (<< var64_off)) Details for each case, For "MOV ALU32" BPF arch zero extends so we simply copy the bounds from 32-bit into 64-bit ensuring we truncate var_off and 64-bit bounds correctly. See zext_32_to_64. For "MOV ALU64" copy all bounds including 32-bit into new register. If the src register had 32-bit bounds the dst register will as well. For "op ALU32" zero extend 32-bit into 64-bit the same as move, see zext_32_to_64. For "op ALU64" calculate both 32-bit and 64-bit bounds no merging is done here. Except we have a special case. When RSH or ARSH is done we can't simply ignore shifting bits from 64-bit reg into the 32-bit subreg. So currently just push bounds from 64-bit into 32-bit. This will be correct in the sense that they will represent a valid state of the register. However we could lose some accuracy if an ARSH is following a jmp32 operation. We can handle this special case in a follow up series. For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We special case if 64-bit bounds has zero'd upper 32bits at which point we can simply copy 32-bit bounds into 64-bit register. This catches a common compiler trick where upper 32-bits are zeroed and then 32-bit ops are used followed by a 64-bit compare or 64-bit op on a pointer. See __reg_combine_64_into_32(). For "jmp ALU64" cast the bounds of the 64bit to their 32-bit counterpart. For example s32_min_value = (s32)reg->smin_value. For tnum use only the lower 32bits via, (>>(<<var_off)). See __reg_combine_64_into_32(). Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
| | * | | | | | | | bpf: Verifier, do_refine_retval_range may clamp umin to 0 incorrectlyJohn Fastabend2020-03-301-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_refine_retval_range() is called to refine return values from specified helpers, probe_read_str and get_stack at the moment, the reasoning is because both have a max value as part of their input arguments and because the helper ensure the return value will not be larger than this we can set smax values of the return register, r0. However, the return value is a signed integer so setting umax is incorrect It leads to further confusion when the do_refine_retval_range() then calls, __reg_deduce_bounds() which will see a umax value as meaning the value is unsigned and then assuming it is unsigned set the smin = umin which in this case results in 'smin = 0' and an 'smax = X' where X is the input argument from the helper call. Here are the comments from _reg_deduce_bounds() on why this would be safe to do. /* Learn sign from unsigned bounds. Signed bounds cross the sign * boundary, so we must be careful. */ if ((s64)reg->umax_value >= 0) { /* Positive. We can't learn anything from the smin, but smax * is positive, hence safe. */ reg->smin_value = reg->umin_value; reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value, reg->umax_value); But now we incorrectly have a return value with type int with the signed bounds (0,X). Suppose the return value is negative, which is possible the we have the verifier and reality out of sync. Among other things this may result in any error handling code being falsely detected as dead-code and removed. For instance the example below shows using bpf_probe_read_str() causes the error path to be identified as dead code and removed. >From the 'llvm-object -S' dump, r2 = 100 call 45 if r0 s< 0 goto +4 r4 = *(u32 *)(r7 + 0) But from dump xlate (b7) r2 = 100 (85) call bpf_probe_read_compat_str#-96768 (61) r4 = *(u32 *)(r7 +0) <-- dropped if goto Due to verifier state after call being R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f)) To fix omit setting the umax value because its not safe. The only actual bounds we know is the smax. This results in the correct bounds (SMIN, X) where X is the max length from the helper. After this the new verifier state looks like the following after call 45. R0=inv(id=0,smax_value=100) Then xlated version no longer removed dead code giving the expected result, (b7) r2 = 100 (85) call bpf_probe_read_compat_str#-96768 (c5) if r0 s< 0x0 goto pc+4 (61) r4 = *(u32 *)(r7 +0) Note, bpf_probe_read_* calls are root only so we wont hit this case with non-root bpf users. v3: comment had some documentation about meta set to null case which is not relevant here and confusing to include in the comment. v2 note: In original version we set msize_smax_value from check_func_arg() and propagated this into smax of retval. The logic was smax is the bound on the retval we set and because the type in the helper is ARG_CONST_SIZE we know that the reg is a positive tnum_const() so umax=smax. Alexei pointed out though this is a bit odd to read because the register in check_func_arg() has a C type of u32 and the umax bound would be the normally relavent bound here. Pulling in extra knowledge about future checks makes reading the code a bit tricky. Further having a signed meta data that can only ever be positive is also a bit odd. So dropped the msize_smax_value metadata and made it a u64 msize_max_value to indicate its unsigned. And additionally save bound from umax value in check_arg_funcs which is the same as smax due to as noted above tnumx_cont and negative check but reads better. By my analysis nothing functionally changes in v2 but it does get easier to read so that is win. Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560417900.10843.14351995140624628941.stgit@john-Precision-5820-Tower
| | * | | | | | | | bpf: btf: Fix arg verification in btf_ctx_access()KP Singh2020-03-301-7/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bounds checking for the arguments accessed in the BPF program breaks when the expected_attach_type is not BPF_TRACE_FEXIT, BPF_LSM_MAC or BPF_MODIFY_RETURN resulting in no check being done for the default case (the programs which do not receive the return value of the attached function in its arguments) when the index of the argument being accessed is equal to the number of arguments (nr_args). This was a result of a misplaced "else if" block introduced by the Commit 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN") Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330144246.338-1-kpsingh@chromium.org