summaryrefslogtreecommitdiffstats
path: root/kernel (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Revert BPF token-related functionalityAndrii Nakryiko2023-12-1911-767/+94
| | | | | | | | | | | | This patch includes the following revert (one conflicting BPF FS patch and three token patch sets, represented by merge commits): - revert 0f5d5454c723 "Merge branch 'bpf-fs-mount-options-parsing-follow-ups'"; - revert 750e785796bb "bpf: Support uid and gid when mounting bpffs"; - revert 733763285acf "Merge branch 'bpf-token-support-in-libbpf-s-bpf-object'"; - revert c35919dcce28 "Merge branch 'bpf-token-and-bpf-fs-based-delegation'". Link: https://lore.kernel.org/bpf/CAHk-=wg7JuFYwGy=GOMbRCtOL+jwSQsdUaBsRWkDVYbxipbM5A@mail.gmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
* Merge tag 'for-netdev' of ↵Jakub Kicinski2023-12-1922-458/+1497
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-12-18 This PR is larger than usual and contains changes in various parts of the kernel. The main changes are: 1) Fix kCFI bugs in BPF, from Peter Zijlstra. End result: all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. 2) Introduce BPF token object, from Andrii Nakryiko. It adds an ability to delegate a subset of BPF features from privileged daemon (e.g., systemd) through special mount options for userns-bound BPF FS to a trusted unprivileged application. The design accommodates suggestions from Christian Brauner and Paul Moore. Example: $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp 3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei. - Complete precision tracking support for register spills - Fix verification of possibly-zero-sized stack accesses - Fix access to uninit stack slots - Track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs. - Fix verifier retval logic 4) Support for VLAN tag in XDP hints, from Larysa Zaremba. 5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu. End result: better memory utilization and lower I$ miss for calls to BPF via BPF trampoline. 6) Fix race between BPF prog accessing inner map and parallel delete, from Hou Tao. 7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu. It allows BPF interact with IPSEC infra. The intent is to support software RSS (via XDP) for the upcoming ipsec pcpu work. Experiments on AWS demonstrate single tunnel pcpu ipsec reaching line rate on 100G ENA nics. 8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao. 9) BPF file verification via fsverity, from Song Liu. It allows BPF progs get fsverity digest. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits) bpf: Ensure precise is reset to false in __mark_reg_const_zero() selftests/bpf: Add more uprobe multi fail tests bpf: Fail uprobe multi link with negative offset selftests/bpf: Test the release of map btf s390/bpf: Fix indirect trampoline generation selftests/bpf: Temporarily disable dummy_struct_ops test on s390 x86/cfi,bpf: Fix bpf_exception_cb() signature bpf: Fix dtor CFI cfi: Add CFI_NOSEAL() x86/cfi,bpf: Fix bpf_struct_ops CFI x86/cfi,bpf: Fix bpf_callback_t CFI x86/cfi,bpf: Fix BPF JIT call cfi: Flip headers selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment bpf: Limit the number of kprobes when attaching program to multiple kprobes bpf: Limit the number of uprobes when attaching program to multiple uprobes bpf: xdp: Register generic_kfunc_set with XDP programs selftests/bpf: utilize string values for delegate_xxx mount options ... ==================== Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * bpf: Ensure precise is reset to false in __mark_reg_const_zero()Andrii Nakryiko2023-12-181-18/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is safe to always start with imprecise SCALAR_VALUE register. Previously __mark_reg_const_zero() relied on caller to reset precise mark, but it's very error prone and we already missed it in a few places. So instead make __mark_reg_const_zero() reset precision always, as it's a safe default for SCALAR_VALUE. Explanation is basically the same as for why we are resetting (or rather not setting) precision in current state. If necessary, precision propagation will set it to precise correctly. As such, also remove a big comment about forward precision propagation in mark_reg_stack_read() and avoid unnecessarily setting precision to true after reading from STACK_ZERO stack. Again, precision propagation will correctly handle this, if that SCALAR_VALUE register will ever be needed to be precise. Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Maxim Mikityanskiy <maxtram95@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231218173601.53047-1-andrii@kernel.org
| * bpf: Fail uprobe multi link with negative offsetJiri Olsa2023-12-181-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the __uprobe_register will return 0 (success) when called with negative offset. The reason is that the call to register_for_each_vma and then build_map_info won't return error for negative offset. They just won't do anything - no matching vma is found so there's no registered breakpoint for the uprobe. I don't think we can change the behaviour of __uprobe_register and fail for negative uprobe offset, because apps might depend on that already. But I think we can still make the change and check for it on bpf multi link syscall level. Also moving the __get_user call and check for the offsets to the top of loop, to fail early without extra __get_user calls for ref_ctr_offset and cookie arrays. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20231217215538.3361991-2-jolsa@kernel.org
| * x86/cfi,bpf: Fix bpf_exception_cb() signatureAlexei Starovoitov2023-12-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | As per the earlier patches, BPF sub-programs have bpf_callback_t signature and CFI expects callers to have matching signature. This is violated by bpf_prog_aux::bpf_exception_cb(). [peterz: Changelog] Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAADnVQ+Z7UcXXBBhMubhcMM=R-dExk-uHtfOLtoLxQ1XxEpqEA@mail.gmail.com Link: https://lore.kernel.org/r/20231215092707.910319166@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Fix dtor CFIPeter Zijlstra2023-12-162-3/+21
| | | | | | | | | | | | | | | | | | | | | | Ensure the various dtor functions match their prototype and retain their CFI signatures, since they don't have their address taken, they are prone to not getting CFI, making them impossible to call indirectly. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.799451071@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * x86/cfi,bpf: Fix bpf_struct_ops CFIPeter Zijlstra2023-12-161-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | BPF struct_ops uses __arch_prepare_bpf_trampoline() to write trampolines for indirect function calls. These tramplines much have matching CFI. In order to obtain the correct CFI hash for the various methods, add a matching structure that contains stub functions, the compiler will generate correct CFI which we can pilfer for the trampolines. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * x86/cfi,bpf: Fix BPF JIT callPeter Zijlstra2023-12-161-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current BPF call convention is __nocfi, except when it calls !JIT things, then it calls regular C functions. It so happens that with FineIBT the __nocfi and C calling conventions are incompatible. Specifically __nocfi will call at func+0, while FineIBT will have endbr-poison there, which is not a valid indirect target. Causing #CP. Notably this only triggers on IBT enabled hardware, which is probably why this hasn't been reported (also, most people will have JIT on anyway). Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for x86. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Limit the number of kprobes when attaching program to multiple kprobesHou Tao2023-12-151-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An abnormally big cnt may also be assigned to kprobe_multi.cnt when attaching multiple kprobes. It will trigger the following warning in kvmalloc_node(): if (unlikely(size > INT_MAX)) { WARN_ON_ONCE(!(flags & __GFP_NOWARN)); return NULL; } Fix the warning by limiting the maximal number of kprobes in bpf_kprobe_multi_link_attach(). If the number of kprobes is greater than MAX_KPROBE_MULTI_CNT, the attachment will fail and return -E2BIG. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231215100708.2265609-3-houtao@huaweicloud.com
| * bpf: Limit the number of uprobes when attaching program to multiple uprobesHou Tao2023-12-151-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An abnormally big cnt may be passed to link_create.uprobe_multi.cnt, and it will trigger the following warning in kvmalloc_node(): if (unlikely(size > INT_MAX)) { WARN_ON_ONCE(!(flags & __GFP_NOWARN)); return NULL; } Fix the warning by limiting the maximal number of uprobes in bpf_uprobe_multi_link_attach(). If the number of uprobes is greater than MAX_UPROBE_MULTI_CNT, the attachment will return -E2BIG. Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Reported-by: Xingwei Lee <xrivendell7@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Closes: https://lore.kernel.org/bpf/CABOYnLwwJY=yFAGie59LFsUsBAgHfroVqbzZ5edAXbFE3YiNVA@mail.gmail.com Link: https://lore.kernel.org/bpf/20231215100708.2265609-2-houtao@huaweicloud.com
| * bpf: xdp: Register generic_kfunc_set with XDP programsDaniel Xu2023-12-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Registering generic_kfunc_set with XDP programs enables some of the newer BPF features inside XDP -- namely tree based data structures and BPF exceptions. The current motivation for this commit is to enable assertions inside XDP bpf progs. Assertions are a standard and useful tool to encode intent. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/d07d4614b81ca6aada44fcb89bb6b618fb66e4ca.1702594357.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: support symbolic BPF FS delegation mount optionsAndrii Nakryiko2023-12-151-38/+211
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Besides already supported special "any" value and hex bit mask, support string-based parsing of delegation masks based on exact enumerator names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`, `enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported symbolic names (ignoring __MAX_xxx guard values and stripping repetitive prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is normalized to lower case in mount option output. So "PROG_LOAD", "prog_load", and "MAP_create" are all valid values to specify for delegate_cmds options, "array" is among supported for map types, etc. Besides supporting string values, we also support multiple values specified at the same time, using colon (':') separator. There are corresponding changes on bpf_show_options side to use known values to print them in human-readable format, falling back to hex mask printing, if there are any unrecognized bits. This shouldn't be necessary when enum BTF information is present, but in general we should always be able to fall back to this even if kernel was built without BTF. As mentioned, emitted symbolic names are normalized to be all lower case. Example below shows various ways to specify delegate_cmds options through mount command and how mount options are printed back: 12/14 14:39:07.604 vmuser@archvm:~/local/linux/tools/testing/selftests/bpf $ mount | rg token $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp $ mount | grep token bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp) Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231214225016.1209867-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Fix a race condition between btf_put() and map_free()Yonghong Song2023-12-151-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running `./test_progs -j` in my local vm with latest kernel, I once hit a kasan error like below: [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0 [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830 [ 1887.186498] [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494 [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred [ 1887.190341] Call Trace: [ 1887.190666] <TASK> [ 1887.190949] dump_stack_lvl+0xac/0xe0 [ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0 [ 1887.192019] ? panic+0x3c0/0x3c0 [ 1887.192449] print_report+0x14f/0x720 [ 1887.192930] ? preempt_count_sub+0x1c/0xd0 [ 1887.193459] ? __virt_addr_valid+0xac/0x120 [ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0 [ 1887.194572] kasan_report+0xc3/0x100 [ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0 [ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0 [ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0 [ 1887.196736] ? preempt_count_sub+0x1c/0xd0 [ 1887.197270] ? preempt_count_sub+0x1c/0xd0 [ 1887.197802] ? _raw_spin_unlock+0x1f/0x40 [ 1887.198319] bpf_obj_free_fields+0x1d4/0x260 [ 1887.198883] array_map_free+0x1a3/0x260 [ 1887.199380] bpf_map_free_deferred+0x7b/0xe0 [ 1887.199943] process_scheduled_works+0x3a2/0x6c0 [ 1887.200549] worker_thread+0x633/0x890 [ 1887.201047] ? __kthread_parkme+0xd7/0xf0 [ 1887.201574] ? kthread+0x102/0x1d0 [ 1887.202020] kthread+0x1ab/0x1d0 [ 1887.202447] ? pr_cont_work+0x270/0x270 [ 1887.202954] ? kthread_blkcg+0x50/0x50 [ 1887.203444] ret_from_fork+0x34/0x50 [ 1887.203914] ? kthread_blkcg+0x50/0x50 [ 1887.204397] ret_from_fork_asm+0x11/0x20 [ 1887.204913] </TASK> [ 1887.204913] </TASK> [ 1887.205209] [ 1887.205416] Allocated by task 2197: [ 1887.205881] kasan_set_track+0x3f/0x60 [ 1887.206366] __kasan_kmalloc+0x6e/0x80 [ 1887.206856] __kmalloc+0xac/0x1a0 [ 1887.207293] btf_parse_fields+0xa15/0x1480 [ 1887.207836] btf_parse_struct_metas+0x566/0x670 [ 1887.208387] btf_new_fd+0x294/0x4d0 [ 1887.208851] __sys_bpf+0x4ba/0x600 [ 1887.209292] __x64_sys_bpf+0x41/0x50 [ 1887.209762] do_syscall_64+0x4c/0xf0 [ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b [ 1887.210868] [ 1887.211074] Freed by task 36: [ 1887.211460] kasan_set_track+0x3f/0x60 [ 1887.211951] kasan_save_free_info+0x28/0x40 [ 1887.212485] ____kasan_slab_free+0x101/0x180 [ 1887.213027] __kmem_cache_free+0xe4/0x210 [ 1887.213514] btf_free+0x5b/0x130 [ 1887.213918] rcu_core+0x638/0xcc0 [ 1887.214347] __do_softirq+0x114/0x37e The error happens at bpf_rb_root_free+0x1f8/0x2b0: 00000000000034c0 <bpf_rb_root_free>: ; { 34c0: f3 0f 1e fa endbr64 34c4: e8 00 00 00 00 callq 0x34c9 <bpf_rb_root_free+0x9> 34c9: 55 pushq %rbp 34ca: 48 89 e5 movq %rsp, %rbp ... ; if (rec && rec->refcount_off >= 0 && 36aa: 4d 85 ed testq %r13, %r13 36ad: 74 a9 je 0x3658 <bpf_rb_root_free+0x198> 36af: 49 8d 7d 10 leaq 0x10(%r13), %rdi 36b3: e8 00 00 00 00 callq 0x36b8 <bpf_rb_root_free+0x1f8> <==== kasan function 36b8: 45 8b 7d 10 movl 0x10(%r13), %r15d <==== use-after-free load 36bc: 45 85 ff testl %r15d, %r15d 36bf: 78 8c js 0x364d <bpf_rb_root_free+0x18d> So the problem is at rec->refcount_off in the above. I did some source code analysis and find the reason. CPU A CPU B bpf_map_put: ... btf_put with rcu callback ... bpf_map_free_deferred with system_unbound_wq ... ... ... ... btf_free_rcu: ... ... ... bpf_map_free_deferred: ... ... ... ---------> btf_struct_metas_free() ... | race condition ... ... ---------> map->ops->map_free() ... ... btf->struct_meta_tab = NULL In the above, map_free() corresponds to array_map_free() and eventually calling bpf_rb_root_free() which calls: ... __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false); ... Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code: meta = btf_find_struct_meta(btf, btf_id); if (!meta) return -EFAULT; rec->fields[i].graph_root.value_rec = meta->record; So basically, 'value_rec' is a pointer to the record in struct_metas_tab. And it is possible that that particular record has been freed by btf_struct_metas_free() and hence we have a kasan error here. Actually it is very hard to reproduce the failure with current bpf/bpf-next code, I only got the above error once. To increase reproducibility, I added a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which significantly increased reproducibility. diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 5e43ddd1b83f..aae5b5213e93 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work) struct bpf_map *map = container_of(work, struct bpf_map, work); struct btf_record *rec = map->record; + mdelay(100); security_bpf_map_free(map); bpf_map_release_memcg(map); /* implementation dependent freeing */ Hao also provided test cases ([1]) for easily reproducing the above issue. There are two ways to fix the issue, the v1 of the patch ([2]) moving btf_put() after map_free callback, and the v5 of the patch ([3]) using a kptr style fix which tries to get a btf reference during map_check_btf(). Each approach has its pro and cons. The first approach delays freeing btf while the second approach needs to acquire reference depending on context which makes logic not very elegant and may complicate things with future new data structures. Alexei suggested in [4] going back to v1 which is what this patch tries to do. Rerun './test_progs -j' with the above mdelay() hack for a couple of times and didn't observe the error for the above rb_root test cases. Running Hou's test ([1]) is also successful. [1] https://lore.kernel.org/bpf/20231207141500.917136-1-houtao@huaweicloud.com/ [2] v1: https://lore.kernel.org/bpf/20231204173946.3066377-1-yonghong.song@linux.dev/ [3] v5: https://lore.kernel.org/bpf/20231208041621.2968241-1-yonghong.song@linux.dev/ [4] v4: https://lore.kernel.org/bpf/CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com/ Cc: Hou Tao <houtao@huaweicloud.com> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231214203815.1469107-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Use GFP_KERNEL in bpf_event_entry_gen()Hou Tao2023-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | rcu_read_lock() is no longer held when invoking bpf_event_entry_gen() which is called by perf_event_fd_array_get_ptr(), so using GFP_KERNEL instead of GFP_ATOMIC to reduce the possibility of failures due to out-of-memory. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231214043010.3458072-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Reduce the scope of rcu_read_lock when updating fd mapHou Tao2023-12-142-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two callbacks. For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need rcu-read-lock because array->ptrs must still be allocated. For bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use rcu_read_lock() during the invocation of htab_map_update_elem(). Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231214043010.3458072-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Update the comments in maybe_wait_bpf_programs()Hou Tao2023-12-141-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in sleepable programs"), sleepable BPF program can also use map-in-map, but maybe_wait_bpf_programs() doesn't handle it accordingly. The main reason is that using synchronize_rcu_tasks_trace() to wait for the completions of these sleepable BPF programs may incur a very long delay and userspace may think it is hung, so the wait for sleepable BPF programs is skipped. Update the comments in maybe_wait_bpf_programs() to reflect the reason. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20231211083447.1921178-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: add small subset of SECURITY_PATH hooks to BPF sleepable_lsm_hooks listMatt Bobrowski2023-12-141-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | security_path_* based LSM hooks appear to be generally missing from the sleepable_lsm_hooks list. Initially add a small subset of them to the preexisting sleepable_lsm_hooks list so that sleepable BPF helpers like bpf_d_path() can be used from sleepable BPF LSM based programs. The security_path_* hooks added in this patch are similar to the security_inode_* counterparts that already exist in the sleepable_lsm_hooks list, and are called in roughly similar points and contexts. Presumably, making them OK to be also annotated as sleepable. Building a kernel with DEBUG_ATOMIC_SLEEP options enabled and running reasonable workloads stimulating activity that would be intercepted by such security hooks didn't show any splats. Notably, I haven't added all the security_path_* LSM hooks that are available as I don't need them at this point in time. Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/ZXM3IHHXpNY9y82a@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: fail BPF_TOKEN_CREATE if no delegation option was set on BPF FSAndrii Nakryiko2023-12-141-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's quite confusing in practice when it's possible to successfully create a BPF token from BPF FS that didn't have any of delegate_xxx mount options set up. While it's not wrong, it's actually more meaningful to reject BPF_TOKEN_CREATE with specific error code (-ENOENT) to let user-space know that no token delegation is setup up. So, instead of creating empty BPF token that will be always ignored because it doesn't have any of the allow_xxx bits set, reject it with -ENOENT. If we ever need empty BPF token to be possible, we can support that with extra flag passed into BPF_TOKEN_CREATE. Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231213190842.3844987-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Support uid and gid when mounting bpffsJie Jiang2023-12-141-1/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Parse uid and gid in bpf_parse_param() so that they can be passed in as the `data` parameter when mount() bpffs. This will be useful when we want to control which user/group has the control to the mounted bpffs, otherwise a separate chown() call will be needed. Signed-off-by: Jie Jiang <jiejiang@chromium.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Mike Frysinger <vapier@chromium.org> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231212093923.497838-1-jiejiang@chromium.org
| * bpf: Comment on check_mem_size_regAndrei Matei2023-12-131-0/+6
| | | | | | | | | | | | | | | | | | | | This patch adds a comment to check_mem_size_reg -- a function whose meaning is not very transparent. The function implicitly deals with two registers connected by convention, which is not obvious. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231210225149.67639-1-andreimatei1@gmail.com
| * bpf: Remove unused backtrack_state helper functionsYang Li2023-12-121-15/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function are defined in the verifier.c file, but not called elsewhere, so delete the unused function. kernel/bpf/verifier.c:3448:20: warning: unused function 'bt_set_slot' kernel/bpf/verifier.c:3453:20: warning: unused function 'bt_clear_slot' kernel/bpf/verifier.c:3488:20: warning: unused function 'bt_is_slot_set' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231212005436.103829-1-yang.lee@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7714
| * bpf: tidy up exception callback management a bitAndrii Nakryiko2023-12-122-22/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the fact that we are passing subprog index around and have a corresponding struct bpf_subprog_info in bpf_verifier_env for each subprogram. We don't need to separately pass around a flag whether subprog is exception callback or not, each relevant verifier function can determine this using provided subprog index if we maintain bpf_subprog_info properly. Also move out exception callback-specific logic from btf_prepare_func_args(), keeping it generic. We can enforce all these restriction right before exception callback verification pass. We add out parameter, arg_cnt, for now, but this will be unnecessary with subsequent refactoring and will be removed. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231204233931.49758-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: emit more dynptr information in verifier logAndrii Nakryiko2023-12-121-9/+16
| | | | | | | | | | | | | | | | | | | | Emit dynptr type for CONST_PTR_TO_DYNPTR register. Also emit id, ref_obj_id, and dynptr_id fields for STACK_DYNPTR stack slots. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231204233931.49758-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: log PTR_TO_MEM memory size in verifier logAndrii Nakryiko2023-12-121-0/+4
| | | | | | | | | | | | | | | | | | Emit valid memory size addressable through PTR_TO_MEM register. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231204233931.49758-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Add bpf_cpumask_weight() kfuncDavid Vernet2023-12-101-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | It can be useful to query how many bits are set in a cpumask. For example, if you want to perform special logic for the last remaining core that's set in a mask. Let's therefore add a new bpf_cpumask_weight() kfunc which checks how many bits are set in a mask. Signed-off-by: David Vernet <void@manifault.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231207210843.168466-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: handle fake register spill to stack with BPF_ST_MEM instructionAndrii Nakryiko2023-12-101-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When verifier validates BPF_ST_MEM instruction that stores known constant to stack (e.g., *(u64 *)(r10 - 8) = 123), it effectively spills a fake register with a constant (but initially imprecise) value to a stack slot. Because read-side logic treats it as a proper register fill from stack slot, we need to mark such stack slot initialization as INSN_F_STACK_ACCESS instruction to stop precision backtracking from missing it. Fixes: 41f6f64e6999 ("bpf: support non-r10 register spill/fill to/from stack in precision tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231209010958.66758-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Set uattr->batch.count as zero before batched update or deletionHou Tao2023-12-101-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | generic_map_{delete,update}_batch() doesn't set uattr->batch.count as zero before it tries to allocate memory for key. If the memory allocation fails, the value of uattr->batch.count will be incorrect. Fix it by setting uattr->batch.count as zero beore batched update or deletion. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Only call maybe_wait_bpf_programs() when map operation succeedsHou Tao2023-12-101-2/+4
| | | | | | | | | | | | | | | | | | | | There is no need to call maybe_wait_bpf_programs() if update or deletion operation fails. So only call maybe_wait_bpf_programs() if update or deletion operation succeeds. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Add missed maybe_wait_bpf_programs() for htab of mapsHou Tao2023-12-101-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing batched lookup and deletion operations on htab of maps, maybe_wait_bpf_programs() is needed to ensure all programs don't use the inner map after the bpf syscall returns. Instead of adding the wait in __htab_map_lookup_and_delete_batch(), adding the wait in bpf_map_do_batch() and also removing the calling of maybe_wait_bpf_programs() from generic_map_{delete,update}_batch(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Call maybe_wait_bpf_programs() only once for generic_map_update_batch()Hou Tao2023-12-101-1/+3
| | | | | | | | | | | | | | | | | | | | | | Just like commit 9087c6ff8dfe ("bpf: Call maybe_wait_bpf_programs() only once from generic_map_delete_batch()"), there is also no need to call maybe_wait_bpf_programs() for each update in batched update, so only call it once in generic_map_update_batch(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Remove unnecessary wait from bpf_map_copy_value()Hou Tao2023-12-101-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | Both map_lookup_elem() and generic_map_lookup_batch() use bpf_map_copy_value() to lookup and copy the value, and there is no update operation in bpf_map_copy_value(), so just remove the invocation of maybe_wait_bpf_programs() from it. Fixes: 15c14a3dca42 ("bpf: Add bpf_map_{value_size, update_value, map_copy_value} functions") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Enable bpf_cgrp_storage for cgroup1 non-attach caseYafang Shao2023-12-091-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the current cgroup1 environment, associating operations between cgroups and applications in a BPF program requires storing a mapping of cgroup_id to application either in a hash map or maintaining it in userspace. However, by enabling bpf_cgrp_storage for cgroup1, it becomes possible to conveniently store application-specific information in cgroup-local storage and utilize it within BPF programs. Furthermore, enabling this feature for cgroup1 involves minor modifications for the non-attach case, streamlining the process. However, when it comes to enabling this functionality for the cgroup1 attach case, it presents challenges. Therefore, the decision is to focus on enabling it solely for the cgroup1 non-attach case at present. If attempting to attach to a cgroup1 fd, the operation will simply fail with the error code -EBADF. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231206115326.4295-2-laoar.shao@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
| * bpf: Minor cleanup around stack boundsAndrei Matei2023-12-081-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | Push the rounding up of stack offsets into the function responsible for growing the stack, rather than relying on all the callers to do it. Uncertainty about whether the callers did it or not tripped up people in a previous review. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231208032519.260451-4-andreimatei1@gmail.com
| * bpf: Fix accesses to uninit stack slotsAndrei Matei2023-12-081-39/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Privileged programs are supposed to be able to read uninitialized stack memory (ever since 6715df8d5) but, before this patch, these accesses were permitted inconsistently. In particular, accesses were permitted above state->allocated_stack, but not below it. In other words, if the stack was already "large enough", the access was permitted, but otherwise the access was rejected instead of being allowed to "grow the stack". This undesired rejection was happening in two places: - in check_stack_slot_within_bounds() - in check_stack_range_initialized() This patch arranges for these accesses to be permitted. A bunch of tests that were relying on the old rejection had to change; all of them were changed to add also run unprivileged, in which case the old behavior persists. One tests couldn't be updated - global_func16 - because it can't run unprivileged for other reasons. This patch also fixes the tracking of the stack size for variable-offset reads. This second fix is bundled in the same commit as the first one because they're inter-related. Before this patch, writes to the stack using registers containing a variable offset (as opposed to registers with fixed, known values) were not properly contributing to the function's needed stack size. As a result, it was possible for a program to verify, but then to attempt to read out-of-bounds data at runtime because a too small stack had been allocated for it. Each function tracks the size of the stack it needs in bpf_subprog_info.stack_depth, which is maintained by update_stack_depth(). For regular memory accesses, check_mem_access() was calling update_state_depth() but it was passing in only the fixed part of the offset register, ignoring the variable offset. This was incorrect; the minimum possible value of that register should be used instead. This tracking is now fixed by centralizing the tracking of stack size in grow_stack_state(), and by lifting the calls to grow_stack_state() to check_stack_access_within_bounds() as suggested by Andrii. The code is now simpler and more convincingly tracks the correct maximum stack size. check_stack_range_initialized() can now rely on enough stack having been allocated for the access; this helps with the fix for the first issue. A few tests were changed to also check the stack depth computation. The one that fails without this patch is verifier_var_off:stack_write_priv_vs_unpriv. Fixes: 01f810ace9ed3 ("bpf: Allow variable-offset stack access") Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231208032519.260451-3-andreimatei1@gmail.com Closes: https://lore.kernel.org/bpf/CABWLsev9g8UP_c3a=1qbuZUi20tGoUXoU07FPf-5FLvhOKOY+Q@mail.gmail.com/
| * bpf: Guard stack limits against 32bit overflowAndrei Matei2023-12-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch promotes the arithmetic around checking stack bounds to be done in the 64-bit domain, instead of the current 32bit. The arithmetic implies adding together a 64-bit register with a int offset. The register was checked to be below 1<<29 when it was variable, but not when it was fixed. The offset either comes from an instruction (in which case it is 16 bit), from another register (in which case the caller checked it to be below 1<<29 [1]), or from the size of an argument to a kfunc (in which case it can be a u32 [2]). Between the register being inconsistently checked to be below 1<<29, and the offset being up to an u32, it appears that we were open to overflowing the `int`s which were currently used for arithmetic. [1] https://github.com/torvalds/linux/blob/815fb87b753055df2d9e50f6cd80eb10235fe3e9/kernel/bpf/verifier.c#L7494-L7498 [2] https://github.com/torvalds/linux/blob/815fb87b753055df2d9e50f6cd80eb10235fe3e9/kernel/bpf/verifier.c#L11904 Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231207041150.229139-4-andreimatei1@gmail.com
| * bpf: Fix verification of indirect var-off stack accessAndrei Matei2023-12-071-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a bug around the verification of possibly-zero-sized stack accesses. When the access was done through a var-offset stack pointer, check_stack_access_within_bounds was incorrectly computing the maximum-offset of a zero-sized read to be the same as the register's min offset. Instead, we have to take in account the register's maximum possible value. The patch also simplifies how the max offset is checked; the check is now simpler than for min offset. The bug was allowing accesses to erroneously pass the check_stack_access_within_bounds() checks, only to later crash in check_stack_range_initialized() when all the possibly-affected stack slots are iterated (this time with a correct max offset). check_stack_range_initialized() is relying on check_stack_access_within_bounds() for its accesses to the stack-tracking vector to be within bounds; in the case of zero-sized accesses, we were essentially only verifying that the lowest possible slot was within bounds. We would crash when the max-offset of the stack pointer was >= 0 (which shouldn't pass verification, and hopefully is not something anyone's code attempts to do in practice). Thanks Hao for reporting! Fixes: 01f810ace9ed3 ("bpf: Allow variable-offset stack access") Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231207041150.229139-2-andreimatei1@gmail.com Closes: https://lore.kernel.org/bpf/CACkBjsZGEUaRCHsmaX=h-efVogsRfK1FPxmkgb0Os_frnHiNdw@mail.gmail.com/
| * bpf: Use arch_bpf_trampoline_sizeSong Liu2023-12-073-22/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of blindly allocating PAGE_SIZE for each trampoline, check the size of the trampoline with arch_bpf_trampoline_size(). This size is saved in bpf_tramp_image->size, and used for modmem charge/uncharge. The fallback arch_alloc_bpf_trampoline() still allocates a whole page because we need to use set_memory_* to protect the memory. struct_ops trampoline still uses a whole page for multiple trampolines. With this size check at caller (regular trampoline and struct_ops trampoline), remove arch_bpf_trampoline_size() from arch_prepare_bpf_trampoline() in archs. Also, update bpf_image_ksym_add() to handle symbol of different sizes. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv Link: https://lore.kernel.org/r/20231206224054.492250-7-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Add arch_bpf_trampoline_size()Song Liu2023-12-071-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This helper will be used to calculate the size of the trampoline before allocating the memory. arch_prepare_bpf_trampoline() for arm64 and riscv64 can use arch_bpf_trampoline_size() to check the trampoline fits in the image. OTOH, arch_prepare_bpf_trampoline() for s390 has to call the JIT process twice, so it cannot use arch_bpf_trampoline_size(). Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> # on riscv Link: https://lore.kernel.org/r/20231206224054.492250-6-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Add helpers for trampoline image managementSong Liu2023-12-072-14/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As BPF trampoline of different archs moves from bpf_jit_[alloc|free]_exec() to bpf_prog_pack_[alloc|free](), we need to use different _alloc, _free for different archs during the transition. Add the following helpers for this transition: void *arch_alloc_bpf_trampoline(unsigned int size); void arch_free_bpf_trampoline(void *image, unsigned int size); void arch_protect_bpf_trampoline(void *image, unsigned int size); void arch_unprotect_bpf_trampoline(void *image, unsigned int size); The fallback version of these helpers require size <= PAGE_SIZE, but they are only called with size == PAGE_SIZE. They will be called with size < PAGE_SIZE when arch_bpf_trampoline_size() helper is introduced later. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-4-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Adjust argument names of arch_prepare_bpf_trampoline()Song Liu2023-12-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are using "im" for "struct bpf_tramp_image" and "tr" for "struct bpf_trampoline" in most of the code base. The only exception is the prototype and fallback version of arch_prepare_bpf_trampoline(). Update them to match the rest of the code base. We mix "orig_call" and "func_addr" for the argument in different versions of arch_prepare_bpf_trampoline(). s/orig_call/func_addr/g so they match. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-3-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Let bpf_prog_pack_free handle any pointerSong Liu2023-12-072-15/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, bpf_prog_pack_free only can only free pointer to struct bpf_binary_header, which is not flexible. Add a size argument to bpf_prog_pack_free so that it can handle any pointer. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> # on s390x Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20231206224054.492250-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf,lsm: add BPF token LSM hooksAndrii Nakryiko2023-12-062-6/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to allocate LSM security blob (we add `void *security` field to struct bpf_token for that), but also control who can instantiate BPF token. This follows existing pattern for BPF map and BPF prog. Also add security_bpf_token_allow_cmd() and security_bpf_token_capable() LSM hooks that allow LSM implementation to control and negate (if necessary) BPF token's delegation of a specific bpf_cmd and capability, respectively. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-12-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf,lsm: refactor bpf_map_alloc/bpf_map_free LSM hooksAndrii Nakryiko2023-12-062-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc hook into bpf_map_create, taking not just struct bpf_map, but also bpf_attr and bpf_token, to give a fuller context to LSMs. Unlike bpf_prog_alloc, there is no need to move the hook around, as it currently is firing right before allocating BPF map ID and FD, which seems to be a sweet spot. But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free LSM hook is called even if bpf_map_create hook returned error, as if few LSMs are combined together it could be that one LSM successfully allocated security blob for its needs, while subsequent LSM rejected BPF map creation. The former LSM would still need to free up LSM blob, so we need to ensure security_bpf_map_free() is called regardless of the outcome. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-11-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooksAndrii Nakryiko2023-12-062-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on upstream discussion ([0]), rework existing bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF program struct. Also, we pass bpf_attr union with all the user-provided arguments for BPF_PROG_LOAD command. This will give LSMs as much information as we can basically provide. The hook is also BPF token-aware now, and optional bpf_token struct is passed as a third argument. bpf_prog_load LSM hook is called after a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were allocated and filled out, but right before performing full-fledged BPF verification step. bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for consistency. SELinux code is adjusted to all new names, types, and signatures. Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be used by some LSMs to allocate extra security blob, but also by other LSMs to reject BPF program loading, we need to make sure that bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one *even* if the hook itself returned error. If we don't do that, we run the risk of leaking memory. This seems to be possible today when combining SELinux and BPF LSM, as one example, depending on their relative ordering. Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to sleepable LSM hooks list, as they are both executed in sleepable context. Also drop bpf_prog_load hook from untrusted, as there is no issue with refcount or anything else anymore, that originally forced us to add it to untrusted list in c0c852dd1876 ("bpf: Do not mark certain LSM hook arguments as trusted"). We now trigger this hook much later and it should not be an issue anymore. [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/ Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: consistently use BPF token throughout BPF verifier logicAndrii Nakryiko2023-12-063-9/+8
| | | | | | | | | | | | | | | | | | | | Remove remaining direct queries to perfmon_capable() and bpf_capable() in BPF verifier logic and instead use BPF token (if available) to make decisions about privileges. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: take into account BPF token when fetching helper protosAndrii Nakryiko2023-12-064-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | Instead of performing unconditional system-wide bpf_capable() and perfmon_capable() calls inside bpf_base_func_proto() function (and other similar ones) to determine eligibility of a given BPF helper for a given program, use previously recorded BPF token during BPF_PROG_LOAD command handling to inform the decision. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: add BPF token support to BPF_PROG_LOAD commandAndrii Nakryiko2023-12-064-26/+95
| | | | | | | | | | | | | | | | | | | | | | Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of allowed BPF program types and attach types, derived from BPF FS at BPF token creation time. Then make sure we perform bpf_token_capable() checks everywhere where it's relevant. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: add BPF token support to BPF_BTF_LOAD commandAndrii Nakryiko2023-12-061-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading through delegated BPF token. BTF loading is a pretty straightforward operation, so as long as BPF token is created with allow_cmds granting BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating BTF object. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: add BPF token support to BPF_MAP_CREATE commandAndrii Nakryiko2023-12-063-15/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow providing token_fd for BPF_MAP_CREATE command to allow controlled BPF map creation from unprivileged process through delegated BPF token. Wire through a set of allowed BPF map types to BPF token, derived from BPF FS at BPF token creation time. This, in combination with allowed_cmds allows to create a narrowly-focused BPF token (controlled by privileged agent) with a restrictive set of BPF maps that application can attempt to create. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: introduce BPF token objectAndrii Nakryiko2023-12-064-6/+239
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new kind of BPF kernel object, BPF token. BPF token is meant to allow delegating privileged BPF functionality, like loading a BPF program or creating a BPF map, from privileged process to a *trusted* unprivileged process, all while having a good amount of control over which privileged operations could be performed using provided BPF token. This is achieved through mounting BPF FS instance with extra delegation mount options, which determine what operations are delegatable, and also constraining it to the owning user namespace (as mentioned in the previous patch). BPF token itself is just a derivative from BPF FS and can be created through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF FS FD, which can be attained through open() API by opening BPF FS mount point. Currently, BPF token "inherits" delegated command, map types, prog type, and attach type bit sets from BPF FS as is. In the future, having an BPF token as a separate object with its own FD, we can allow to further restrict BPF token's allowable set of things either at the creation time or after the fact, allowing the process to guard itself further from unintentionally trying to load undesired kind of BPF programs. But for now we keep things simple and just copy bit sets as is. When BPF token is created from BPF FS mount, we take reference to the BPF super block's owning user namespace, and then use that namespace for checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} capabilities that are normally only checked against init userns (using capable()), but now we check them using ns_capable() instead (if BPF token is provided). See bpf_token_capable() for details. Such setup means that BPF token in itself is not sufficient to grant BPF functionality. User namespaced process has to *also* have necessary combination of capabilities inside that user namespace. So while previously CAP_BPF was useless when granted within user namespace, now it gains a meaning and allows container managers and sys admins to have a flexible control over which processes can and need to use BPF functionality within the user namespace (i.e., container in practice). And BPF FS delegation mount options and derived BPF tokens serve as a per-container "flag" to grant overall ability to use bpf() (plus further restrict on which parts of bpf() syscalls are treated as namespaced). Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF) within the BPF FS owning user namespace, rounding up the ns_capable() story of BPF token. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>