diff options
author | Daniel Borkmann <daniel@iogearbox.net> | 2018-03-31 02:17:57 +0200 |
---|---|---|
committer | Daniel Borkmann <daniel@iogearbox.net> | 2018-03-31 02:18:07 +0200 |
commit | 7828f20e3779e4e85e55371e0e43f5006a15fb41 (patch) | |
tree | 48f4977b0b8e69bd6432b18556ad9ac7ca7728eb /kernel/bpf/syscall.c | |
parent | Merge branch 'bpf-sockmap-sg-api-fixes' (diff) | |
parent | selftests/bpf: Selftest for sys_bind post-hooks. (diff) | |
download | linux-7828f20e3779e4e85e55371e0e43f5006a15fb41.tar.xz linux-7828f20e3779e4e85e55371e0e43f5006a15fb41.zip |
Merge branch 'bpf-cgroup-bind-connect'
Andrey Ignatov says:
====================
v2->v3:
- rebase due to conflicts
- fix ipv6=m build
v1->v2:
- support expected_attach_type at prog load time so that prog (incl.
context accesses and calls to helpers) can be validated with regard to
specific attach point it is supposed to be attached to.
Later, at attach time, attach type is checked so that it must be same as
at load time if it was provided
- reworked hooks to rely on expected_attach_type, and reduced number of new
prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR
- reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks
- add selftests for post-sys_bind hook
For our container management we've been using complicated and fragile setup
consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
all containerized applications. Unfortunately it doesn't work for apps that
don't use glibc and changing all applications that run in the datacenter
is not possible due to 3rd party code and libraries (despite being
open source code) and sheer amount of legacy code that has to be rewritten
(we're rewriting what we can in parallel)
These applications are written without containers in mind and have
builtin assumptions about network services. Like an application X
expects to connect localhost:special_port and find service Y in there.
To move application X and service Y into two different containers
LD_PRELOAD approach is used to help one service connect to another
without rewriting them.
Moving these two applications into different L2 (netns) or L3 (vrf)
network isolation scopes doesn't help to solve the problem, since
applications need to see each other like they were running on
the host without containers.
So if app X and app Y would run in different netns something
would need to punch a connectivity hole in those namespaces.
That would be real layering violation (with corresponding
network debugging pains), since clean l2, l3 abstraction would
suddenly support something that breaks through the layers.
Instead we used LD_PRELOAD (and now bpf programs) at bind/connect
time to help applications discover and connect to each other.
All applications are running in init_nens and there are no vrfs.
After bind/connect the normal fib/neighbor core networking
logic works as it should always do and the whole system is
clean from network point of view and can be debugged with
standard tools.
We also considered resurrecting Hannes's afnetns work,
but all hierarchical namespace abstraction don't work due
to these builtin networking assumptions inside the apps.
To run an application inside cgroup container that was not written
with containers in mind we have to make an illusion of running
in non-containerized environment.
In some cases we remember the port and container id in the post-bind hook
in a bpf map and when some other task in a different container is trying
to connect to a service we need to know where this service is running.
It can be remote and can be local. Both client and service may or may not
be written with containers in mind and this sockaddr rewrite is providing
connectivity and load balancing feature.
BPF+cgroup looks to be the best solution for this problem.
Hence we introduce 3 hooks:
- at entry into sys_bind and sys_connect
to let bpf prog look and modify 'struct sockaddr' provided
by user space and fail bind/connect when appropriate
- post sys_bind after port is allocated
The approach works great and has zero overhead for anyone who doesn't
use it and very low overhead when deployed.
Different use case for this feature is to do low overhead firewall
that doesn't need to inspect all packets and works at bind/connect time.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Diffstat (limited to 'kernel/bpf/syscall.c')
-rw-r--r-- | kernel/bpf/syscall.c | 102 |
1 files changed, 101 insertions, 1 deletions
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 95ca2523fa6e..0244973ee544 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1171,8 +1171,75 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type, } EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev); +/* Initially all BPF programs could be loaded w/o specifying + * expected_attach_type. Later for some of them specifying expected_attach_type + * at load time became required so that program could be validated properly. + * Programs of types that are allowed to be loaded both w/ and w/o (for + * backward compatibility) expected_attach_type, should have the default attach + * type assigned to expected_attach_type for the latter case, so that it can be + * validated later at attach time. + * + * bpf_prog_load_fixup_attach_type() sets expected_attach_type in @attr if + * prog type requires it but has some attach types that have to be backward + * compatible. + */ +static void bpf_prog_load_fixup_attach_type(union bpf_attr *attr) +{ + switch (attr->prog_type) { + case BPF_PROG_TYPE_CGROUP_SOCK: + /* Unfortunately BPF_ATTACH_TYPE_UNSPEC enumeration doesn't + * exist so checking for non-zero is the way to go here. + */ + if (!attr->expected_attach_type) + attr->expected_attach_type = + BPF_CGROUP_INET_SOCK_CREATE; + break; + } +} + +static int +bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type, + enum bpf_attach_type expected_attach_type) +{ + switch (prog_type) { + case BPF_PROG_TYPE_CGROUP_SOCK: + switch (expected_attach_type) { + case BPF_CGROUP_INET_SOCK_CREATE: + case BPF_CGROUP_INET4_POST_BIND: + case BPF_CGROUP_INET6_POST_BIND: + return 0; + default: + return -EINVAL; + } + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: + switch (expected_attach_type) { + case BPF_CGROUP_INET4_BIND: + case BPF_CGROUP_INET6_BIND: + case BPF_CGROUP_INET4_CONNECT: + case BPF_CGROUP_INET6_CONNECT: + return 0; + default: + return -EINVAL; + } + default: + return 0; + } +} + +static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog, + enum bpf_attach_type attach_type) +{ + switch (prog->type) { + case BPF_PROG_TYPE_CGROUP_SOCK: + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: + return attach_type == prog->expected_attach_type ? 0 : -EINVAL; + default: + return 0; + } +} + /* last field in 'union bpf_attr' used by this command */ -#define BPF_PROG_LOAD_LAST_FIELD prog_ifindex +#define BPF_PROG_LOAD_LAST_FIELD expected_attach_type static int bpf_prog_load(union bpf_attr *attr) { @@ -1209,11 +1276,17 @@ static int bpf_prog_load(union bpf_attr *attr) !capable(CAP_SYS_ADMIN)) return -EPERM; + bpf_prog_load_fixup_attach_type(attr); + if (bpf_prog_load_check_attach_type(type, attr->expected_attach_type)) + return -EINVAL; + /* plain bpf_prog allocation */ prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER); if (!prog) return -ENOMEM; + prog->expected_attach_type = attr->expected_attach_type; + prog->aux->offload_requested = !!attr->prog_ifindex; err = security_bpf_prog_alloc(prog->aux); @@ -1453,8 +1526,16 @@ static int bpf_prog_attach(const union bpf_attr *attr) ptype = BPF_PROG_TYPE_CGROUP_SKB; break; case BPF_CGROUP_INET_SOCK_CREATE: + case BPF_CGROUP_INET4_POST_BIND: + case BPF_CGROUP_INET6_POST_BIND: ptype = BPF_PROG_TYPE_CGROUP_SOCK; break; + case BPF_CGROUP_INET4_BIND: + case BPF_CGROUP_INET6_BIND: + case BPF_CGROUP_INET4_CONNECT: + case BPF_CGROUP_INET6_CONNECT: + ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR; + break; case BPF_CGROUP_SOCK_OPS: ptype = BPF_PROG_TYPE_SOCK_OPS; break; @@ -1474,6 +1555,11 @@ static int bpf_prog_attach(const union bpf_attr *attr) if (IS_ERR(prog)) return PTR_ERR(prog); + if (bpf_prog_attach_check_attach_type(prog, attr->attach_type)) { + bpf_prog_put(prog); + return -EINVAL; + } + cgrp = cgroup_get_from_fd(attr->target_fd); if (IS_ERR(cgrp)) { bpf_prog_put(prog); @@ -1510,8 +1596,16 @@ static int bpf_prog_detach(const union bpf_attr *attr) ptype = BPF_PROG_TYPE_CGROUP_SKB; break; case BPF_CGROUP_INET_SOCK_CREATE: + case BPF_CGROUP_INET4_POST_BIND: + case BPF_CGROUP_INET6_POST_BIND: ptype = BPF_PROG_TYPE_CGROUP_SOCK; break; + case BPF_CGROUP_INET4_BIND: + case BPF_CGROUP_INET6_BIND: + case BPF_CGROUP_INET4_CONNECT: + case BPF_CGROUP_INET6_CONNECT: + ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR; + break; case BPF_CGROUP_SOCK_OPS: ptype = BPF_PROG_TYPE_SOCK_OPS; break; @@ -1561,6 +1655,12 @@ static int bpf_prog_query(const union bpf_attr *attr, case BPF_CGROUP_INET_INGRESS: case BPF_CGROUP_INET_EGRESS: case BPF_CGROUP_INET_SOCK_CREATE: + case BPF_CGROUP_INET4_BIND: + case BPF_CGROUP_INET6_BIND: + case BPF_CGROUP_INET4_POST_BIND: + case BPF_CGROUP_INET6_POST_BIND: + case BPF_CGROUP_INET4_CONNECT: + case BPF_CGROUP_INET6_CONNECT: case BPF_CGROUP_SOCK_OPS: case BPF_CGROUP_DEVICE: break; |