summaryrefslogtreecommitdiffstats
path: root/include (follow)
Commit message (Collapse)AuthorAgeFilesLines
* net: devlink: select NET_DEVLINK from driversJiri Pirko2019-03-241-492/+3
| | | | | | | | | | Some drivers are becoming more dependent on NET_DEVLINK being selected in configuration. With upcoming compat functions, the behavior would be wrong in case devlink was not compiled in. So make the drivers select NET_DEVLINK and rely on the functions being there, not just stubs. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: devlink: add port type spinlockJiri Pirko2019-03-241-0/+4
| | | | | | | | | | | Add spinlock to protect port type and type_dev pointer consistency. Without that, userspace may see inconsistent type and type_dev combinations. Signed-off-by: Jiri Pirko <jiri@mellanox.com> v1->v2: - rebased Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'mlx5-updates-2019-03-20' of ↵David S. Miller2019-03-243-0/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2019-03-20 This series includes updates to mlx5 driver, 1) Compiler warnings cleanup from Saeed Mahameed 2) Parav Pandit simplifies sriov enable/disables 3) Gustavo A. R. Silva, Removes a redundant assignment 4) Moshe Shemesh, Adds Geneve tunnel stateless offload support 5) Eli Britstein, Adds the Support for VLAN modify action and Replaces TC VLAN pop and push actions with VLAN modify Note: This series includes two simple non-mlx5 patches, 1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h, and use it in some drivers. 2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h, and use it in mlx5 and nfp drivers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/mlx5e: Add VLAN ID rewrite fieldsEli Britstein2019-03-221-0/+1
| | | | | | | | | | | | | | | | Add VLAN ID rewrite fields as a pre-step to support this rewrite. Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net: Add IANA_VXLAN_UDP_PORT definition to vxlan header fileMoshe Shemesh2019-03-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added IANA_VXLAN_UDP_PORT (4789) definition to vxlan header file so it can be used by drivers instead of local definition. Updated drivers which locally defined it as 4789 to use it. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Cc: John Hurley <john.hurley@netronome.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Yunsheng Lin <linyunsheng@huawei.com> Cc: Peng Li <lipeng321@huawei.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net: Move the definition of the default Geneve udp port to public header fileMoshe Shemesh2019-03-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the definition of the default Geneve udp port from the geneve source to the header file, so we can re-use it from drivers. Modify existing drivers to use it. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Cc: John Hurley <john.hurley@netronome.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
* | tcp: add one skb cache for rxEric Dumazet2019-03-241-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Often times, recvmsg() system calls and BH handling for a particular TCP socket are done on different cpus. This means the incoming skb had to be allocated on a cpu, but freed on another. This incurs a high spinlock contention in slab layer for small rpc, but also a high number of cache line ping pongs for larger packets. A full size GRO packet might use 45 page fragments, meaning that up to 45 put_page() can be involved. More over performing the __kfree_skb() in the recvmsg() context adds a latency for user applications, and increase probability of trapping them in backlog processing, since the BH handler might found the socket owned by the user. This patch, combined with the prior one increases the rpc performance by about 10 % on servers with large number of cores. (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps instead of 8 Mpps) This also increases single bulk flow performance on 40Gbit+ links, since in this case there are often two cpus working in tandem : - CPU handling the NIC rx interrupts, feeding the receive queue, and (after this patch) freeing the skbs that were consumed. - CPU in recvmsg() system call, essentially 100 % busy copying out data to user space. Having at most one skb in a per-socket cache has very little risk of memory exhaustion, and since it is protected by socket lock, its management is essentially free. Note that if rps/rfs is used, we do not enable this feature, because there is high chance that the same cpu is handling both the recvmsg() system call and the TCP rx path, but that another cpu did the skb allocations in the device driver right before the RPS/RFS logic. To properly handle this case, it seems we would need to record on which cpu skb was allocated, and use a different channel to give skbs back to this cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: add one skb cache for txEric Dumazet2019-03-241-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks. 20.69% [kernel] [k] queued_spin_lock_slowpath 5.64% [kernel] [k] _raw_spin_lock 3.83% [kernel] [k] syscall_return_via_sysret 3.48% [kernel] [k] __entry_text_start 1.76% [kernel] [k] __netif_receive_skb_core 1.64% [kernel] [k] __fget For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes. In many cases, ACK packets are handled by another cpus, and this unfortunately incurs heavy costs for slab layer. This patch uses an extra pointer in socket structure, so that we try to reuse the same skb and avoid these expensive costs. We cache at most one skb per socket so this should be safe as far as memory pressure is concerned. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: convert rps_needed and rfs_needed to new static branch apiEric Dumazet2019-03-242-3/+3
| | | | | | | | | | | | | | | | | | We prefer static_branch_unlikely() over static_key_false() these days. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: add empty status flag for NOLOCK qdiscPaolo Abeni2019-03-241-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The queue is marked not empty after acquiring the seqlock, and it's up to the NOLOCK qdisc clearing such flag on dequeue. Since the empty status lays on the same cache-line of the seqlock, it's always hot on cache during the updates. This makes the empty flag update a little bit loosy. Given the lack of synchronization between enqueue and dequeue, this is unavoidable. v2 -> v3: - qdisc_is_empty() has a const argument (Eric) v1 -> v2: - use really an 'empty' flag instead of 'not_empty', as suggested by Eric Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: add documentation for tcp_ca_stateSoheil Hassas Yeganeh2019-03-241-0/+27
|/ | | | | | | | | | | | | Add documentation to the tcp_ca_state enum, since this enum is exposed in uapi. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Cc: Sowmini Varadhan <sowmini05@gmail.com> Acked-by: Sowmini Varadhan <sowmini05@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* genetlink: make policy common to familyJohannes Berg2019-03-222-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since maxattr is common, the policy can't really differ sanely, so make it common as well. The only user that did in fact manage to make a non-common policy is taskstats, which has to be really careful about it (since it's still using a common maxattr!). This is no longer supported, but we can fake it using pre_doit. This reduces the size of e.g. nl80211.o (which has lots of commands): text data bss dec hex filename 398745 14323 2240 415308 6564c net/wireless/nl80211.o (before) 397913 14331 2240 414484 65314 net/wireless/nl80211.o (after) -------------------------------- -832 +8 0 -824 Which is obviously just 8 bytes for each command, and an added 8 bytes for the new policy pointer. I'm not sure why the ops list is counted as .text though. Most of the code transformations were done using the following spatch: @ops@ identifier OPS; expression POLICY; @@ struct genl_ops OPS[] = { ..., { - .policy = POLICY, }, ... }; @@ identifier ops.OPS; expression ops.POLICY; identifier fam; expression M; @@ struct genl_family fam = { .ops = OPS, .maxattr = M, + .policy = POLICY, ... }; This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing the cb->data as ops, which we want to change in a later genl patch. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rhashtable: rename rht_for_each*continue as *from.NeilBrown2019-03-211-20/+20
| | | | | | | | | | | | | | | | | | The pattern set by list.h is that for_each..continue() iterators start at the next entry after the given one, while for_each..from() iterators start at the given entry. The rht_for_each*continue() iterators are documented as though the start at the 'next' entry, but actually start at the given entry, and they are used expecting that behaviour. So fix the documentation and change the names to *from for consistency with list.h Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rhashtable: don't hold lock on first table throughout insertion.NeilBrown2019-03-211-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rhashtable_try_insert() currently holds a lock on the bucket in the first table, while also locking buckets in subsequent tables. This is unnecessary and looks like a hold-over from some earlier version of the implementation. As insert and remove always lock a bucket in each table in turn, and as insert only inserts in the final table, there cannot be any races that are not covered by simply locking a bucket in each table in turn. When an insert call reaches that last table it can be sure that there is no matchinf entry in any other table as it has searched them all, and insertion never happens anywhere but in the last table. The fact that code tests for the existence of future_tbl while holding a lock on the relevant bucket ensures that two threads inserting the same key will make compatible decisions about which is the "last" table. This simplifies the code and allows the ->rehash field to be discarded. We still need a way to ensure that a dead bucket_table is never re-linked by rhashtable_walk_stop(). This can be achieved by calling call_rcu() inside the locked region, and checking with rcu_head_after_call_rcu() in rhashtable_walk_stop() to see if the bucket table is empty and dead. Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dst: remove gc leftoversJulian Wiedmann2019-03-211-11/+0
| | | | | | | | | | Get rid of some obsolete gc-related documentation and macros that were missed in commit 5b7c9a8ff828 ("net: remove dst gc related code"). CC: Wei Wang <weiwan@google.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Allow amount of dirty memory from fib resizing to be controllableDavid Ahern2019-03-211-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fib_trie implementation calls synchronize_rcu when a certain amount of pages are dirty from freed entries. The number of pages was determined experimentally in 2009 (commit c3059477fce2d). At the current setting, synchronize_rcu is called often -- 51 times in a second in one test with an average of an 8 msec delay adding a fib entry. The total impact is a lot of slow down modifying the fib. This is seen in the output of 'time' - the difference between real time and sys+user. For example, using 720,022 single path routes and 'ip -batch'[1]: $ time ./ip -batch ipv4/routes-1-hops real 0m14.214s user 0m2.513s sys 0m6.783s So roughly 35% of the actual time to install the routes is from the ip command getting scheduled out, most notably due to synchronize_rcu (this is observed using 'perf sched timehist'). This patch makes the amount of dirty memory configurable between 64k where the synchronize_rcu is called often (small, low end systems that are memory sensitive) to 64M where synchronize_rcu is called rarely during a large FIB change (for high end systems with lots of memory). The default is 512kB which corresponds to the current setting of 128 pages with a 4kB page size. As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu in a second blocking for up to 30 msec in a single instance, and a total of almost 100 msec across the 4 calls in the second. The trade off is allowing FIB entries to consume more memory in a given time window but but with much better fib insertion rates (~30% increase in prefixes/sec). With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch file runs in: $ time ./ip -batch ipv4/routes-1-hops real 0m9.692s user 0m2.491s sys 0m6.769s So the dead time is reduced to about 1/2 second or <5% of the real time. [1] 'ip' modified to not request ACK messages which improves route insertion times by about 20% Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: Add ioctl() TUNGETDEVNETNS cmd to allow obtaining real net ns of tun deviceKirill Tkhai2019-03-211-0/+1
| | | | | | | | | | | | | | | | | | In commit f2780d6d7475 "tun: Add ioctl() SIOCGSKNS cmd to allow obtaining net ns of tun device" it was missed that tun may change its net ns, while net ns of socket remains the same as it was created initially. SIOCGSKNS returns net ns of socket, so it is not suitable for obtaining net ns of device. We may have two tun devices with the same names in two net ns, and in this case it's not possible to determ, which of them fd refers to (TUNGETIFF will return the same name). This patch adds new ioctl() cmd for obtaining net ns of a device. Reported-by: Harald Albrecht <harald.albrecht@gmx.net> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Change addrconf_f6i_alloc to use ip6_route_info_createDavid Ahern2019-03-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Change addrconf_f6i_alloc to generate a fib6_config and call ip6_route_info_create. addrconf_f6i_alloc is the last caller to fib6_info_alloc besides ip6_route_info_create, and there is no reason for it to do its own initialization on a fib6_info. Host routes need to be created even if the device is down, so add a new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init to not error out if device is not up. Notes on the conversion: - ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL and fc_mx_len set to 0 - dst_nocount is handled by the RTF_ADDRCONF flag - dst_host is handled by fc_dst_len = 128 nh_gw does not get set after the conversion to ip6_route_info_create but it should not be set in addrconf_f6i_alloc since this is a host route not a gateway route. Everything else is a straight forward map between fib6_info and fib6_config. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Add icmp_echo_ignore_anycast for ICMPv6Stephen Suryaputra2019-03-211-0/+1
| | | | | | | | In addition to icmp_echo_ignore_multicast, there is a need to also prevent responding to pings to anycast addresses for security. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: remove 'fallback' argument from dev->ndo_select_queue()Paolo Abeni2019-03-201-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the previous patch, all the callers of ndo_select_queue() provide as a 'fallback' argument netdev_pick_tx. The only exceptions are nested calls to ndo_select_queue(), which pass down the 'fallback' available in the current scope - still netdev_pick_tx. We can drop such argument and replace fallback() invocation with netdev_pick_tx(). This avoids an indirect call per xmit packet in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen) with device drivers implementing such ndo. It also clean the code a bit. Tested with ixgbe and CONFIG_FCOE=m With pktgen using queue xmit: threads vanilla patched (kpps) (kpps) 1 2334 2428 2 4166 4278 4 7895 8100 v1 -> v2: - rebased after helper's name change Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* packet: rework packet_pick_tx_queue() to use common code selectionPaolo Abeni2019-03-201-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Currently packet_pick_tx_queue() is the only caller of ndo_select_queue() using a fallback argument other than netdev_pick_tx. Leveraging rx queue, we can obtain a similar queue selection behavior using core helpers. After this change, ndo_select_queue() is always invoked with netdev_pick_tx() as fallback. We can change ndo_select_queue() signature in a followup patch, dropping an indirect call per transmitted packet in some scenarios (e.g. TCP syn and XDP generic xmit) This changes slightly how af packet queue selection happens when PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit() tacking in account both XPS and TC mapping. v1 -> v2: - rebased after helper name change RFC -> v1: - initialize sender_cpu to the expected value Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dev: rename queue selection helpers.Paolo Abeni2019-03-201-3/+3
| | | | | | | | | | | | | | With the following patches, we are going to use __netdev_pick_tx() in many modules. Rename it to netdev_pick_tx(), to make it clear is a public API. Also rename the existing netdev_pick_tx() to netdev_core_pick_tx(), to avoid name clashes. Suggested-by: Eric Dumazet <edumazet@google.com> Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tls: Add support of AES128-CCM based ciphersVakul Garg2019-03-202-2/+28
| | | | | | | | | | | | | | Added support for AES128-CCM based record encryption. AES128-CCM is similar to AES128-GCM. Both of them have same salt/iv/mac size. The notable difference between the two is that while invoking AES128-CCM operation, the salt||nonce (which is passed as IV) has to be prefixed with a hardcoded value '2'. Further, CCM implementation in kernel requires IV passed in crypto_aead_request() to be full '16' bytes. Therefore, the record structure 'struct tls_rec' has been modified to reserve '16' bytes for IV. This works for both GCM and CCM based cipher. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Add icmp_echo_ignore_multicast support for ICMPv6Stephen Suryaputra2019-03-191-0/+1
| | | | | | | | | | | IPv4 has icmp_echo_ignore_broadcast to prevent responding to broadcast pings. IPv6 needs a similar mechanism. v1->v2: - Remove NET_IPV6_ICMP_ECHO_IGNORE_MULTICAST. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: free request sock directly upon TFO or syncookies errorGuillaume Nault2019-03-191-3/+7
| | | | | | | | | | | | | | | | | | | | | Since the request socket is created locally, it'd make more sense to use reqsk_free() instead of reqsk_put() in TFO and syncookies' error path. However, tcp_get_cookie_sock() may set ->rsk_refcnt before freeing the socket; tcp_conn_request() may also have non-null ->rsk_refcnt because of tcp_try_fastopen(). In both cases 'req' hasn't been exposed to the outside world and is safe to free immediately, but that'd trigger the WARN_ON_ONCE in reqsk_free(). Define __reqsk_free() for these situations where we know nobody's referencing the socket, even though ->rsk_refcnt might be non-null. Now we can consolidate the error path of tcp_get_cookie_sock() and tcp_conn_request(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tipc: support broadcast/replicast configurable for bc-linkHoang Le2019-03-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, a multicast stream uses either broadcast or replicast as transmission method, based on the ratio between number of actual destinations nodes and cluster size. However, when an L2 interface (e.g., VXLAN) provides pseudo broadcast support, this becomes very inefficient, as it blindly replicates multicast packets to all cluster/subnet nodes, irrespective of whether they host actual target sockets or not. The TIPC multicast algorithm is able to distinguish real destination nodes from other nodes, and hence provides a smarter and more efficient method for transferring multicast messages than pseudo broadcast can do. Because of this, we now make it possible for users to force the broadcast link to permanently switch to using replicast, irrespective of which capabilities the bearer provides, or pretend to provide. Conversely, we also make it possible to force the broadcast link to always use true broadcast. While maybe less useful in deployed systems, this may at least be useful for testing the broadcast algorithm in small clusters. We retain the current AUTOSELECT ability, i.e., to let the broadcast link automatically select which algorithm to use, and to switch back and forth between broadcast and replicast as the ratio between destination node number and cluster size changes. This remains the default method. Furthermore, we make it possible to configure the threshold ratio for such switches. The default ratio is now set to 10%, down from 25% in the earlier implementation. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: get sctphdr by offset in sctp_compute_cksumXin Long2019-03-191-1/+1
| | | | | | | | | | | | | | | | | | sctp_hdr(skb) only works when skb->transport_header is set properly. But in Netfilter, skb->transport_header for ipv6 is not guaranteed to be right value for sctphdr. It would cause to fail to check the checksum for sctp packets. So fix it by using offset, which is always right in all places. v1->v2: - Fix the changelog. Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming code") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* packets: Always register packet sk in the same orderMaxime Chevallier2019-03-191-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using fanouts with AF_PACKET, the demux functions such as fanout_demux_cpu will return an index in the fanout socket array, which corresponds to the selected socket. The ordering of this array depends on the order the sockets were added to a given fanout group, so for FANOUT_CPU this means sockets are bound to cpus in the order they are configured, which is OK. However, when stopping then restarting the interface these sockets are bound to, the sockets are reassigned to the fanout group in the reverse order, due to the fact that they were inserted at the head of the interface's AF_PACKET socket list. This means that traffic that was directed to the first socket in the fanout group is now directed to the last one after an interface restart. In the case of FANOUT_CPU, traffic from CPU0 will be directed to the socket that used to receive traffic from the last CPU after an interface restart. This commit introduces a helper to add a socket at the tail of a list, then uses it to register AF_PACKET sockets. Note that this changes the order in which sockets are listed in /proc and with sock_diag. Fixes: dc99f600698d ("packet: Add fanout support") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2019-03-164-63/+167
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2019-03-16 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a umem memory leak on cleanup in AF_XDP, from Björn. 2) Fix BTF to properly resolve forward-declared enums into their corresponding full enum definition types during deduplication, from Andrii. 3) Fix libbpf to reject invalid flags in xsk_socket__create(), from Magnus. 4) Fix accessing invalid pointer returned from bpf_tcp_sock() and bpf_sk_fullsock() after bpf_sk_release() was called, from Martin. 5) Fix generation of load/store DW instructions in PPC JIT, from Naveen. 6) Various fixes in BPF helper function documentation in bpf.h UAPI header used to bpf-helpers(7) man page, from Quentin. 7) Fix segfault in BPF test_progs when prog loading failed, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * xsk: fix umem memory leak on cleanupBjörn Töpel2019-03-161-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the umem is cleaned up, the task that created it might already be gone. If the task was gone, the xdp_umem_release function did not free the pages member of struct xdp_umem. It turned out that the task lookup was not needed at all; The code was a left-over when we moved from task accounting to user accounting [1]. This patch fixes the memory leak by removing the task lookup logic completely. [1] https://lore.kernel.org/netdev/20180131135356.19134-3-bjorn.topel@gmail.com/ Link: https://lore.kernel.org/netdev/c1cb2ca8-6a14-3980-8672-f3de0bb38dfd@suse.cz/ Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt") Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf: add documentation for helpers bpf_spin_lock(), bpf_spin_unlock()Quentin Monnet2019-03-141-0/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add documentation for the BPF spinlock-related helpers to the doc in bpf.h. I added the constraints and restrictions coming with the use of spinlocks for BPF: not all of it is directly related to the use of the helper, but I thought it would be nice for users to find them in the man page. This list of restrictions is nearly a verbatim copy of the list in Alexei's commit log for those helpers. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: fix documentation for eBPF helpersQuentin Monnet2019-03-141-63/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Another round of minor fixes for the documentation of the BPF helpers located in the UAPI bpf.h header file. Changes include: - Moving around description of some helpers, to keep the descriptions in the same order as helpers are declared (bpf_map_push_elem(), leftover from commit 90b1023f68c7 ("bpf: fix documentation for eBPF helpers"), bpf_rc_keydown(), and bpf_skb_ancestor_cgroup_id()). - Fixing typos ("contex" -> "context"). - Harmonising return types ("void* " -> "void *", "uint64_t" -> "u64"). - Addition of the "bpf_" prefix to bpf_get_storage(). - Light additions of RST markup on some keywords. - Empty line deletion between description and return value for bpf_tcp_sock(). - Edit for the description for bpf_skb_ecn_set_ce() (capital letters, acronym expansion, no effect if ECT not set, more details on return value). Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Add bpf_get_listener_sock(struct bpf_sock *sk) helperMartin KaFai Lau2019-03-131-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new helper "struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)" which returns a bpf_sock in TCP_LISTEN state. It will trace back to the listener sk from a request_sock if possible. It returns NULL for all other cases. No reference is taken because the helper ensures the sk is in SOCK_RCU_FREE (where the TCP_LISTEN sock should be in). Hence, bpf_sk_release() is unnecessary and the verifier does not allow bpf_sk_release(listen_sk) to be called either. The following is also allowed because the bpf_prog is run under rcu_read_lock(): sk = bpf_sk_lookup_tcp(); /* if (!sk) { ... } */ listen_sk = bpf_get_listener_sock(sk); /* if (!listen_sk) { ... } */ bpf_sk_release(sk); src_port = listen_sk->src_port; /* Allowed */ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_releaseMartin KaFai Lau2019-03-132-1/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk) can still be accessed after bpf_sk_release(sk). Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue. This patch addresses them together. A simple reproducer looks like this: sk = bpf_sk_lookup_tcp(); /* if (!sk) ... */ tp = bpf_tcp_sock(sk); /* if (!tp) ... */ bpf_sk_release(sk); snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */ The problem is the verifier did not scrub the register's states of the tcp_sock ptr (tp) after bpf_sk_release(sk). [ Note that when calling bpf_tcp_sock(sk), the sk is not always refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works fine for this case. ] Currently, the verifier does not track if a helper's return ptr (in REG_0) is "carry"-ing one of its argument's refcount status. To carry this info, the reg1->id needs to be stored in reg0. One approach was tried, like "reg0->id = reg1->id", when calling "bpf_tcp_sock()". The main idea was to avoid adding another "ref_obj_id" for the same reg. However, overlapping the NULL marking and ref tracking purpose in one "id" does not work well: ref_sk = bpf_sk_lookup_tcp(); fullsock = bpf_sk_fullsock(ref_sk); tp = bpf_tcp_sock(ref_sk); if (!fullsock) { bpf_sk_release(ref_sk); return 0; } /* fullsock_reg->id is marked for NOT-NULL. * Same for tp_reg->id because they have the same id. */ /* oops. verifier did not complain about the missing !tp check */ snd_cwnd = tp->snd_cwnd; Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state". With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can scrub all reg states which has a ref_obj_id match. It is done with the changes in release_reg_references() in this patch. While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and bpf_sk_fullsock() to avoid these helpers from returning another ptr. It will make bpf_sk_release(tp) possible: sk = bpf_sk_lookup_tcp(); /* if (!sk) ... */ tp = bpf_tcp_sock(sk); /* if (!tp) ... */ bpf_sk_release(tp); A separate helper "bpf_get_listener_sock()" will be added in a later patch to do sk_to_full_sk(). Misc change notes: - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON. ARG_PTR_TO_SOCKET is removed from bpf.h since no helper is using it. - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted() because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not refcounted. All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock() take ARG_PTR_TO_SOCK_COMMON. - check_refcount_ok() ensures is_acquire_function() cannot take arg_type_may_be_refcounted() as its argument. - The check_func_arg() can only allow one refcount-ed arg. It is guaranteed by check_refcount_ok() which ensures at most one arg can be refcounted. Hence, it is a verifier internal error if >1 refcount arg found in check_func_arg(). - In release_reference(), release_reference_state() is called first to ensure a match on "reg->ref_obj_id" can be found before scrubbing the reg states with release_reg_references(). - reg_is_refcounted() is no longer needed. 1. In mark_ptr_or_null_regs(), its usage is replaced by "ref_obj_id && ref_obj_id == id" because, when is_null == true, release_reference_state() should only be called on the ref_obj_id obtained by a acquire helper (i.e. is_acquire_function() == true). Otherwise, the following would happen: sk = bpf_sk_lookup_tcp(); /* if (!sk) { ... } */ fullsock = bpf_sk_fullsock(sk); if (!fullsock) { /* * release_reference_state(fullsock_reg->ref_obj_id) * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id. * * Hence, the following bpf_sk_release(sk) will fail * because the ref state has already been released in the * earlier release_reference_state(fullsock_reg->ref_obj_id). */ bpf_sk_release(sk); } 2. In release_reg_references(), the current reg_is_refcounted() call is unnecessary because the id check is enough. - The type_is_refcounted() and type_is_refcounted_or_null() are no longer needed also because reg_is_refcounted() is removed. Fixes: 655a51e536c0 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock") Reported-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | net: add documentation to socket.cPedro Tammela2019-03-152-6/+12
| | | | | | | | | | | | | | | | | | | | | | Adds missing sphinx documentation to the socket.c's functions. Also fixes some whitespaces. I also changed the style of older documentation as an effort to have an uniform documentation style. Signed-off-by: Pedro Tammela <pctammela@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | appletalk: Fix potential NULL pointer dereference in unregister_snap_clientYueHaibing2019-03-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | register_snap_client may return NULL, all the callers check it, but only print a warning. This will result in NULL pointer dereference in unregister_snap_client and other places. It has always been used like this since v2.6 Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2019-03-141-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc patches from Andrew Morton: - a little bit more MM - a few fixups [ The "little bit more MM" is actually just one of the three patches Andrew sent for mm/filemap.c, I'm still mulling over two more of them from Josef Bacik - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: include/linux/swap.h: use offsetof() instead of custom __swapoffset macro tools/testing/selftests/proc/proc-pid-vm.c: test with vsyscall in mind zram: default to lzo-rle instead of lzo filemap: pass vm_fault to the mmap ra helpers
| * | include/linux/swap.h: use offsetof() instead of custom __swapoffset macroPi-Hsun Shih2019-03-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use offsetof() to calculate offset of a field to take advantage of compiler built-in version when possible, and avoid UBSAN warning when compiling with Clang: UBSAN: Undefined behaviour in mm/swapfile.c:3010:38 member access within null pointer of type 'union swap_header' CPU: 6 PID: 1833 Comm: swapon Tainted: G S 4.19.23 #43 Call trace: dump_backtrace+0x0/0x194 show_stack+0x20/0x2c __dump_stack+0x20/0x28 dump_stack+0x70/0x94 ubsan_epilogue+0x14/0x44 ubsan_type_mismatch_common+0xf4/0xfc __ubsan_handle_type_mismatch_v1+0x34/0x54 __se_sys_swapon+0x654/0x1084 __arm64_sys_swapon+0x1c/0x24 el0_svc_common+0xa8/0x150 el0_svc_compat_handler+0x2c/0x38 el0_svc_compat+0x8/0x18 Link: http://lkml.kernel.org/r/20190312081902.223764-1-pihsun@chromium.org Signed-off-by: Pi-Hsun Shih <pihsun@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'pm-5.1-rc1-2' of ↵Linus Torvalds2019-03-142-10/+0
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These are mostly fixes and cleanups on top of the previously merged power management material for 5.1-rc1 with one cpupower utility update that wasn't pushed earlier due to unfortunate timing. Specifics: - Fix registration of new cpuidle governors partially broken during the 5.0 development cycle by mistake (Rafael Wysocki). - Avoid integer overflows in the menu cpuidle governor by making it discard the overflowing data points upfront (Rafael Wysocki). - Fix minor mistake in the recent update of the iowait boost computation in the intel_pstate driver (Rafael Wysocki). - Drop incorrect __init annotation from one function in the pxa2xx cpufreq driver (Arnd Bergmann). - Fix the operating performance points (OPP) framework initialization for devices in multiple power domains if only one of them is scalable (Rajendra Nayak). - Fix mistake in dev_pm_opp_set_rate() which causes it to skip updating the performance state if the new frequency is the same as the old one (Viresh Kumar). - Rework the cancellation of wakeup source timers to avoid potential issues with it and do some cleanups unlocked by that change (Viresh Kumar, Rafael Wysocki). - Clean up the code computing the active/suspended time of devices in the PM-runtime framework after recent changes (Ulf Hansson). - Make the power management infrastructure code use pr_fmt() consistently (Joe Perches). - Clean up the generic power domains (genpd) framework somewhat (Aisheng Dong). - Improve kerneldoc comments for two functions in the cpufreq core (Rafael Wysocki). - Fix typo in a PM QoS file description comment (Aisheng Dong). - Update the handling of CPU boost frequencies in the cpupower utility (Abhishek Goel)" * tag 'pm-5.1-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpuidle: governor: Add new governors to cpuidle_governors again cpufreq: intel_pstate: Fix up iowait_boost computation PM / OPP: Update performance state when freq == old_freq PM / wakeup: Drop wakeup_source_drop() PM / wakeup: Rework wakeup source timer cancellation PM / domains: Remove one unnecessary blank line PM / Domains: Return early for all errors in _genpd_power_off() PM / Domains: Improve warn for multiple states but no governor OPP: Fix handling of multiple power domains PM / QoS: Fix typo in file description cpufreq: pxa2xx: remove incorrect __init annotation PM-runtime: Call pm_runtime_active|suspended_time() from sysfs PM-runtime: Consolidate code to get active/suspended time PM: Add and use pr_fmt() cpufreq: Improve kerneldoc comments for cpufreq_cpu_get/put() cpuidle: menu: Avoid overflows when computing variance tools/power/cpupower: Display boost frequency separately
| | \ \
| | \ \
| *-. \ \ Merge branches 'pm-core', 'pm-sleep' and 'pm-qos'Rafael J. Wysocki2019-03-142-10/+0
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * pm-core: PM-runtime: Call pm_runtime_active|suspended_time() from sysfs PM-runtime: Consolidate code to get active/suspended time * pm-sleep: PM / wakeup: Drop wakeup_source_drop() PM / wakeup: Rework wakeup source timer cancellation * pm-qos: PM / QoS: Fix typo in file description
| | | * | | PM / wakeup: Drop wakeup_source_drop()Rafael J. Wysocki2019-03-121-9/+0
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit d856f39ac1cc ("PM / wakeup: Rework wakeup source timer cancellation") wakeup_source_drop() is a trivial wrapper around __pm_relax() and it has no users except for wakeup_source_destroy() and wakeup_source_trash() which also has no users, so drop it along with the latter and make wakeup_source_destroy() call __pm_relax() directly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
| | * | | PM-runtime: Call pm_runtime_active|suspended_time() from sysfsUlf Hansson2019-03-071-1/+0
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid the open-coding of the accounted time acquisition in runtime_active|suspend_time_show() and make them call pm_runtime_active|suspended_time() instead. Note that this change also indirectly avoids holding dev->power.lock around the do_div() computation and the sprintf() call which is an additional improvement. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2019-03-142-5/+9
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "More fixes in the queue: 1) Netfilter nat can erroneously register the device notifier twice, fix from Florian Westphal. 2) Use after free in nf_tables, from Pablo Neira Ayuso. 3) Parallel update of steering rule fix in mlx5 river, from Eli Britstein. 4) RX processing panic in lan743x, fix from Bryan Whitehead. 5) Use before initialization of TCP_SKB_CB, fix from Christoph Paasch. 6) Fix locking in SRIOV mode of mlx4 driver, from Jack Morgenstein. 7) Fix TX stalls in lan743x due to mishandling of interrupt ACKing modes, from Bryan Whitehead. 8) Fix infoleak in l2tp_ip6_recvmsg(), from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits) pptp: dst_release sk_dst_cache in pptp_sock_destruct MAINTAINERS: GENET & SYSTEMPORT: Add internal Broadcom list l2tp: fix infoleak in l2tp_ip6_recvmsg() net/tls: Inform user space about send buffer availability net_sched: return correct value for *notify* functions lan743x: Fix TX Stall Issue net/mlx4_core: Fix qp mtt size calculation net/mlx4_core: Fix locking in SRIOV mode when switching between events and polling net/mlx4_core: Fix reset flow when in command polling mode mlxsw: minimal: Initialize base_mac mlxsw: core: Prevent duplication during QSFP module initialization net: dwmac-sun8i: fix a missing check of of_get_phy_mode net: sh_eth: fix a missing check of of_get_phy_mode net: 8390: fix potential NULL pointer dereferences net: fujitsu: fix a potential NULL pointer dereference net: qlogic: fix a potential NULL pointer dereference isdn: hfcpci: fix potential NULL pointer dereference Documentation: devicetree: add a new optional property for port mac address net: rocker: fix a potential NULL pointer dereference net: qlge: fix a potential NULL pointer dereference ...
| * \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2019-03-121-4/+8
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree: 1) Fix list corruption in device notifier in the masquerade infrastructure, from Florian Westphal. 2) Fix double-free of sets and use-after-free when deleting elements. 3) Don't bogusly return EBUSY when removing a set after flush command. 4) Use-after-free in dynamically allocate operations. 5) Don't report a new ruleset generation to userspace if transaction list is empty, this invalidates the userspace cache innecessarily. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | netfilter: nf_tables: bogus EBUSY when deleting set after flushPablo Neira Ayuso2019-03-111-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set deletion after flush coming in the same batch results in EBUSY. Add set use counter to track the number of references to this set from rules. We cannot rely on the list of bindings for this since such list is still populated from the preparation phase. Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | netfilter: nf_tables: fix set double-free in abort pathPablo Neira Ayuso2019-03-081-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The abort path can cause a double-free of an anonymous set. Added-and-to-be-aborted rule looks like this: udp dport { 137, 138 } drop The to-be-aborted transaction list looks like this: newset newsetelem newsetelem rule This gets walked in reverse order, so first pass disables the rule, the set elements, then the set. After synchronize_rcu(), we then destroy those in same order: rule, set element, set element, newset. Problem is that the anonymous set has already been bound to the rule, so the rule (lookup expression destructor) already frees the set, when then cause use-after-free when trying to delete the elements from this set, then try to free the set again when handling the newset expression. Rule releases the bound set in first place from the abort path, this causes the use-after-free on set element removal when undoing the new element transactions. To handle this, skip new element transaction if set is bound from the abort path. This is still causes the use-after-free on set element removal. To handle this, remove transaction from the list when the set is already bound. Joint work with Florian Westphal. Fixes: f6ac85858976 ("netfilter: nf_tables: unbind set in rule from commit path") Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1325 Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | y2038: fix socket.h header inclusionArnd Bergmann2019-03-111-1/+1
| | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Referencing the __kernel_long_t type caused some user space applications to stop compiling when they had not already included linux/posix_types.h, e.g. s/multicast.c -o ext/sockets/multicast.lo In file included from /builddir/build/BUILD/php-7.3.3/main/php.h:468, from /builddir/build/BUILD/php-7.3.3/ext/sockets/sockets.c:27: /builddir/build/BUILD/php-7.3.3/ext/sockets/sockets.c: In function 'zm_startup_sockets': /builddir/build/BUILD/php-7.3.3/ext/sockets/sockets.c:776:40: error: '__kernel_long_t' undeclared (first use in this function) 776 | REGISTER_LONG_CONSTANT("SO_SNDTIMEO", SO_SNDTIMEO, CONST_CS | CONST_PERSISTENT); It is safe to include that header here, since it only contains kernel internal types that do not conflict with other user space types. It's still possible that some related build failures remain, but those are likely to be for code that is not already y2038 safe. Reported-by: Laura Abbott <labbott@redhat.com> Fixes: a9beb86ae6e5 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | Merge tag 'dmaengine-5.1-rc1' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds2019-03-144-15/+68
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull dmaengine updates from Vinod Koul: - dmatest updates for modularizing common struct and code - remove SG support for VDMA xilinx IP and updates to driver - Update to dw driver to support Intel iDMA controllers multi-block support - tegra updates for proper reporting of residue - Add Snow Ridge ioatdma device id and support for IOATDMA v3.4 - struct_size() usage and useless LIST_HEAD cleanups in subsystem. - qDMA controller driver for Layerscape SoCs - stm32-dma PM Runtime support - And usual updates to imx-sdma, sprd, Documentation, fsl-edma, bcm2835, qcom_hidma etc * tag 'dmaengine-5.1-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (81 commits) dmaengine: imx-sdma: fix consistent dma test failures dmaengine: imx-sdma: add a test for imx8mq multi sdma devices dmaengine: imx-sdma: add clock ratio 1:1 check dmaengine: dmatest: move test data alloc & free into functions dmaengine: dmatest: add short-hand `buf_size` var in dmatest_func() dmaengine: dmatest: wrap src & dst data into a struct dmaengine: ioatdma: support latency tolerance report (LTR) for v3.4 dmaengine: ioatdma: add descriptor pre-fetch support for v3.4 dmaengine: ioatdma: disable DCA enabling on IOATDMA v3.4 dmaengine: ioatdma: Add Snow Ridge ioatdma device id dmaengine: sprd: Change channel id to slave id for DMA cell specifier dt-bindings: dmaengine: sprd: Change channel id to slave id for DMA cell specifier dmaengine: mv_xor: Use correct device for DMA API Documentation :dmaengine: clarify DMA desc. pointer after submission Documentation: dmaengine: fix dmatest.rst warning dmaengine: k3dma: Add support for dma-channel-mask dmaengine: k3dma: Delete axi_config dmaengine: k3dma: Upgrade k3dma driver to support hisi_asp_dma hardware Documentation: bindings: dma: Add binding for dma-channel-mask Documentation: bindings: k3dma: Extend the k3dma driver binding to support hisi-asp ...
| * \ \ \ \ Merge branch 'topic/tegra' into for-linusVinod Koul2019-03-121-0/+61
| |\ \ \ \ \
| | * | | | | dmaengine: tegra: add tracepoints to driverBen Dooks2019-01-071-0/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add some trace-points to the driver to allow for debuging via the trace pipe. Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Vinod Koul <vkoul@kernel.org>