summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mm: hugetlb: yield when prepping struct pagesCannon Matthews2018-07-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When booting with very large numbers of gigantic (i.e. 1G) pages, the operations in the loop of gather_bootmem_prealloc, and specifically prep_compound_gigantic_page, takes a very long time, and can cause a softlockup if enough pages are requested at boot. For example booting with 3844 1G pages requires prepping (set_compound_head, init the count) over 1 billion 4K tail pages, which takes considerable time. Add a cond_resched() to the outer loop in gather_bootmem_prealloc() to prevent this lockup. Tested: Booted with softlockup_panic=1 hugepagesz=1G hugepages=3844 and no softlockup is reported, and the hugepages are reported as successfully setup. Link: http://lkml.kernel.org/r/20180627214447.260804-1-cannonmatthews@google.com Signed-off-by: Cannon Matthews <cannonmatthews@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* userfaultfd: hugetlbfs: fix userfaultfd_huge_must_wait() pte accessJanosch Frank2018-07-041-5/+7
| | | | | | | | | | | | | | | | | Use huge_ptep_get() to translate huge ptes to normal ptes so we can check them with the huge_pte_* functions. Otherwise some architectures will check the wrong values and will not wait for userspace to bring in the memory. Link: http://lkml.kernel.org/r/20180626132421.78084-1-frankja@linux.ibm.com Fixes: 369cd2121be4 ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges") Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net/smc: fix up merge error with poll changesLinus Torvalds2018-07-031-1/+2
| | | | | | | | | | | | | | | | | | | | My networking merge (commit 4e33d7d47943: "Pull networking fixes from David Miller") got the poll() handling conflict wrong for af_smc. The conflict between my a11e1d432b51 ("Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL") and Ursula Braun's 24ac3a08e658 ("net/smc: rebuild nonblocking connect") should have left the call to sock_poll_wait() in place, just without the socket lock release/retake. And I really should have realized that. But happily, I at least asked Ursula to double-check the merge, and she set me right. This also fixes an incidental whitespace issue nearby that annoyed me while looking at this. Pointed-out-by: Ursula Braun <ubraun@linux.ibm.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds2018-07-022-3/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | Pull MD fixes from Shaohua Li: "Two small fixes for MD: - an error handling fix from me - a recover bug fix for raid10 from BingJing" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md/raid10: fix that replacement cannot complete recovery after reassemble MD: cleanup resources in failure
| * md/raid10: fix that replacement cannot complete recovery after reassembleBingJing Chang2018-06-281-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During assemble, the spare marked for replacement is not checked. conf->fullsync cannot be updated to be 1. As a result, recovery will treat it as a clean array. All recovering sectors are skipped. Original device is replaced with the not-recovered spare. mdadm -C /dev/md0 -l10 -n4 -pn2 /dev/loop[0123] mdadm /dev/md0 -a /dev/loop4 mdadm /dev/md0 --replace /dev/loop0 mdadm -S /dev/md0 # stop array during recovery mdadm -A /dev/md0 /dev/loop[01234] After reassemble, you can see recovery go on, but it completes immediately. In fact, recovery is not actually processed. To solve this problem, we just add the missing logics for replacment spares. (In raid1.c or raid5.c, they have already been checked.) Reported-by: Alex Chen <alexchen@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * MD: cleanup resources in failureShaohua Li2018-06-181-3/+5
| | | | | | | | | | | | We need destroy the memory pool in failure Signed-off-by: Shaohua Li <shli@fb.com>
* | Merge tag 'for-linus' of git://github.com/stffrdhrn/linuxLinus Torvalds2018-07-024-12/+13
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull OpenRISC fixes from Stafford Horne: "Two fixes for issues which were breaking OpenRISC boot: - Fix bug in __pte_free_tlb() exposed in 4.18 by Matthew Wilcox's page table flag addition. - Fix issue booting on real hardware if delay slot detection emulation is disabled" * tag 'for-linus' of git://github.com/stffrdhrn/linux: openrisc: entry: Fix delay slot exception detection openrisc: Call destructor during __pte_free_tlb
| * | openrisc: entry: Fix delay slot exception detectionStafford Horne2018-07-013-11/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally in patch e6d20c55a4 ("openrisc: entry: Fix delay slot detection") I fixed delay slot detection, but only for QEMU. We missed that hardware delay slot detection using delay slot exception flag (DSX) was still broken. This was because QEMU set the DSX flag in both pre-exception supervision register (ESR) and supervision register (SR) register, but on real hardware the DSX flag is only set on the SR register during exceptions. Fix this by carrying the DSX flag into the SR register during exception. We also update the DSX flag read locations to read the value from the SR register not the pt_regs SR register which represents ESR. The ESR should never have the DSX flag set. In the process I updated/removed a few comments to match the current state. Including removing a comment saying that the DSX detection logic was inefficient and needed to be rewritten. I have tested this on QEMU with a patch ensuring it matches the hardware specification. Link: https://lists.gnu.org/archive/html/qemu-devel/2018-07/msg00000.html Fixes: e6d20c55a4 ("openrisc: entry: Fix delay slot detection") Signed-off-by: Stafford Horne <shorne@gmail.com>
| * | openrisc: Call destructor during __pte_free_tlbStafford Horne2018-06-251-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes an issue uncovered when a recent change to add the "page table" flag was merged. During bootup we see many errors like the following: BUG: Bad page state in process mkdir pfn:00bae page:c1ff15c0 count:0 mapcount:-1024 mapping:00000000 index:0x0 flags: 0x0() raw: 00000000 00000000 00000000 fffffbff 00000000 00000100 00000200 00000000 page dumped because: nonzero mapcount Modules linked in: CPU: 0 PID: 46 Comm: mkdir Tainted: G B 4.17.0-simple-smp-07461-g1d40a5ea01d5-dirty #993 Call trace: [<(ptrval)>] show_stack+0x44/0x54 [<(ptrval)>] dump_stack+0xb0/0xe8 [<(ptrval)>] bad_page+0x138/0x174 [<(ptrval)>] ? cpumask_next+0x24/0x34 [<(ptrval)>] free_pages_check_bad+0x6c/0xd0 [<(ptrval)>] free_pcppages_bulk+0x174/0x42c [<(ptrval)>] free_unref_page_commit.isra.17+0xb8/0xc8 [<(ptrval)>] free_unref_page_list+0x10c/0x190 [<(ptrval)>] ? set_reset_devices+0x0/0x2c [<(ptrval)>] release_pages+0x3a0/0x414 [<(ptrval)>] tlb_flush_mmu_free+0x5c/0x90 [<(ptrval)>] tlb_flush_mmu+0x90/0xa4 [<(ptrval)>] arch_tlb_finish_mmu+0x50/0x94 [<(ptrval)>] tlb_finish_mmu+0x30/0x64 [<(ptrval)>] exit_mmap+0x110/0x1e0 [<(ptrval)>] mmput+0x50/0xf0 [<(ptrval)>] do_exit+0x274/0xa94 [<(ptrval)>] do_group_exit+0x50/0x110 [<(ptrval)>] __wake_up_parent+0x0/0x38 [<(ptrval)>] _syscall_return+0x0/0x4 During the __pte_free_tlb path openrisc fails to call the page destructor which would clear the new bits that were introduced. To fix this we are calling the destructor. It seem openrisc was the only architecture missing this, all other architectures either call the destructor like we are doing here or use pte_free. Note: failing to call the destructor was also messing up the zone stats (and will be cause other problems if you were using SPLIT_PTE_PTLOCKS, which we are not yet). Fixes: 1d40a5ea01d53 ("mm: mark pages in use for page tables") Acked-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Stafford Horne <shorne@gmail.com>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2018-07-02114-624/+1249
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Verify netlink attributes properly in nf_queue, from Eric Dumazet. 2) Need to bump memory lock rlimit for test_sockmap bpf test, from Yonghong Song. 3) Fix VLAN handling in lan78xx driver, from Dave Stevenson. 4) Fix uninitialized read in nf_log, from Jann Horn. 5) Fix raw command length parsing in mlx5, from Alex Vesker. 6) Cleanup loopback RDS connections upon netns deletion, from Sowmini Varadhan. 7) Fix regressions in FIB rule matching during create, from Jason A. Donenfeld and Roopa Prabhu. 8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren. 9) More bpfilter build fixes/adjustments from Masahiro Yamada. 10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper Dangaard Brouer. 11) fib_tests.sh file permissions were broken, from Shuah Khan. 12) Make sure BH/preemption is disabled in data path of mac80211, from Denis Kenzior. 13) Don't ignore nla_parse_nested() return values in nl80211, from Johannes berg. 14) Properly account sock objects ot kmemcg, from Shakeel Butt. 15) Adjustments to setting bpf program permissions to read-only, from Daniel Borkmann. 16) TCP Fast Open key endianness was broken, it always took on the host endiannness. Whoops. Explicitly make it little endian. From Yuching Cheng. 17) Fix prefix route setting for link local addresses in ipv6, from David Ahern. 18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva. 19) Various bpf sockmap fixes, from John Fastabend. 20) Use after free for GRO with ESP, from Sabrina Dubroca. 21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from Eric Biggers. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits) qede: Adverstise software timestamp caps when PHC is not available. qed: Fix use of incorrect size in memcpy call. qed: Fix setting of incorrect eswitch mode. qed: Limit msix vectors in kdump kernel to the minimum required count. ipvlan: call dev_change_flags when ipvlan mode is reset ipv6: sr: fix passing wrong flags to crypto_alloc_shash() net: fix use-after-free in GRO with ESP tcp: prevent bogus FRTO undos with non-SACK flows bpf: sockhash, add release routine bpf: sockhash fix omitted bucket lock in sock_close bpf: sockmap, fix smap_list_map_remove when psock is in many maps bpf: sockmap, fix crash when ipv6 sock is added net: fib_rules: bring back rule_exists to match rule during add hv_netvsc: split sub-channel setup into async and sync net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN atm: zatm: Fix potential Spectre v1 s390/qeth: consistently re-enable device features s390/qeth: don't clobber buffer on async TX completion s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6] s390/qeth: fix race when setting MAC address ...
| * \ \ Merge branch 'qed-fixes'David S. Miller2018-07-025-9/+38
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sudarsana Reddy Kalluru says: ==================== qed*: Fix series. The patch series addresses few issues in the qed* drivers. Please consider applying it to 'net' branch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | qede: Adverstise software timestamp caps when PHC is not available.Sudarsana Reddy Kalluru2018-07-021-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ptp clock is not available for a PF (e.g., higher PFs in NPAR mode), get-tsinfo() callback should return the software timestamp capabilities instead of returning the error. Fixes: 4c55215c ("qede: Add driver support for PTP") Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | qed: Fix use of incorrect size in memcpy call.Sudarsana Reddy Kalluru2018-07-021-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the correct size value while copying chassis/port id values. Fixes: 6ad8c632e ("qed: Add support for query/config dcbx.") Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | qed: Fix setting of incorrect eswitch mode.Sudarsana Reddy Kalluru2018-07-022-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default, driver sets the eswitch mode incorrectly as VEB (virtual Ethernet bridging). Need to set VEB eswitch mode only when sriov is enabled, and it should be to set NONE by default. The patch incorporates this change. Fixes: 0fefbfbaa ("qed*: Management firmware - notifications and defaults") Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | qed: Limit msix vectors in kdump kernel to the minimum required count.Sudarsana Reddy Kalluru2018-07-021-0/+8
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory size is limited in the kdump kernel environment. Allocation of more msix-vectors (or queues) consumes few tens of MBs of memory, which might lead to the kdump kernel failure. This patch adds changes to limit the number of MSI-X vectors in kdump kernel to minimum required value (i.e., 2 per engine). Fixes: fe56b9e6a ("qed: Add module with basic common support") Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipvlan: call dev_change_flags when ipvlan mode is resetHangbin Liu2018-07-021-8/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After we change the ipvlan mode from l3 to l2, or vice versa, we only reset IFF_NOARP flag, but don't flush the ARP table cache, which will cause eth->h_dest to be equal to eth->h_source in ipvlan_xmit_mode_l2(). Then the message will not come out of host. Here is the reproducer on local host: ip link set eth1 up ip addr add 192.168.1.1/24 dev eth1 ip link add link eth1 ipvlan1 type ipvlan mode l3 ip netns add net1 ip link set ipvlan1 netns net1 ip netns exec net1 ip link set ipvlan1 up ip netns exec net1 ip addr add 192.168.2.1/24 dev ipvlan1 ip route add 192.168.2.0/24 via 192.168.1.2 ping 192.168.2.2 -c 2 ip netns exec net1 ip link set ipvlan1 type ipvlan mode l2 ping 192.168.2.2 -c 2 Add the same configuration on remote host. After we set the mode to l2, we could find that the src/dst MAC addresses are the same on eth1: 21:26:06.648565 00:b7:13:ad:d3:05 > 00:b7:13:ad:d3:05, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 58356, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.2.1 > 192.168.2.2: ICMP echo request, id 22686, seq 1, length 64 Fix this by calling dev_change_flags(), which will call netdevice notifier with flag change info. v2: a) As pointed out by Wang Cong, check return value for dev_change_flags() when change dev flags. b) As suggested by Stefano and Sabrina, move flags setting before l3mdev_ops. So we don't need to redo ipvlan_{, un}register_nf_hook() again in err path. Reported-by: Jianlin Shi <jishi@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv6: sr: fix passing wrong flags to crypto_alloc_shash()Eric Biggers2018-07-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'mask' argument to crypto_alloc_shash() uses the CRYPTO_ALG_* flags, not 'gfp_t'. So don't pass GFP_KERNEL to it. Fixes: bf355b8d2c30 ("ipv6: sr: add core files for SR HMAC support") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: fix use-after-free in GRO with ESPSabrina Dubroca2018-07-027-10/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the addition of GRO for ESP, gro_receive can consume the skb and return -EINPROGRESS. In that case, the lower layer GRO handler cannot touch the skb anymore. Commit 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") converted some of the gro_receive handlers that can lead to ESP's gro_receive so that they wouldn't access the skb when -EINPROGRESS is returned, but missed other spots, mainly in tunneling protocols. This patch finishes the conversion to using skb_gro_flush_final(), and adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and GUE. Fixes: 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: prevent bogus FRTO undos with non-SACK flowsIlpo Järvinen2018-07-011-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If SACK is not enabled and the first cumulative ACK after the RTO retransmission covers more than the retransmitted skb, a spurious FRTO undo will trigger (assuming FRTO is enabled for that RTO). The reason is that any non-retransmitted segment acknowledged will set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is no indication that it would have been delivered for real (the scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK case so the check for that bit won't help like it does with SACK). Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo in tcp_process_loss. We need to use more strict condition for non-SACK case and check that none of the cumulatively ACKed segments were retransmitted to prove that progress is due to original transmissions. Only then keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in non-SACK case. (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS to better indicate its purpose but to keep this change minimal, it will be done in another patch). Besides burstiness and congestion control violations, this problem can result in RTO loop: When the loss recovery is prematurely undoed, only new data will be transmitted (if available) and the next retransmission can occur only after a new RTO which in case of multiple losses (that are not for consecutive packets) requires one RTO per loss to recover. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Tested-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2018-07-0122-294/+449
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2018-07-01 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) A bpf_fib_lookup() helper fix to change the API before freeze to return an encoding of the FIB lookup result and return the nexthop device index in the params struct (instead of device index as return code that we had before), from David. 2) Various BPF JIT fixes to address syzkaller fallout, that is, do not reject progs when set_memory_*() fails since it could still be RO. Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was an issue, and a memory leak in s390 JIT found during review, from Daniel. 3) Multiple fixes for sockmap/hash to address most of the syzkaller triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(), a missing sock_map_release() routine to hook up to callbacks, and a fix for an omitted bucket lock in sock_close(), from John. 4) Two bpftool fixes to remove duplicated error message on program load, and another one to close the libbpf object after program load. One additional fix for nfp driver's BPF offload to avoid stopping offload completely if replace of program failed, from Jakub. 5) Couple of BPF selftest fixes that bail out in some of the test scripts if the user does not have the right privileges, from Jeffrin. 6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set where we need to set the flag that some of the test cases are expected to fail, from Kleber. 7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF since it has no relation to it and lirc2 users often have configs without cgroups enabled and thus would not be able to use it, from Sean. 8) Fix a selftest failure in sockmap by removing a useless setrlimit() call that would set a too low limit where at the same time we are already including bpf_rlimit.h that does the job, from Yonghong. 9) Fix BPF selftest config with missing missing NET_SCHED, from Anders. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * \ \ Merge branch 'bpf-sockmap-fixes'Daniel Borkmann2018-07-011-70/+166
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | John Fastabend says: ==================== This addresses two syzbot issues that lead to identifying (by Eric and Wei) a class of bugs where we don't correctly check for IPv4/v6 sockets and their associated state. The second issue was a locking omission in sockhash. The first patch addresses IPv6 socks and fixing an error where sockhash would overwrite the prot pointer with IPv4 prot. To fix this build similar solution to TLS ULP. Although we continue to allow socks in all states not just ESTABLISH in this patch set because as Martin points out there should be no issue with this on the sockmap ULP because we don't use the ctx in this code. Once multiple ULPs coexist we may need to revisit this. However we can do this in *next trees. The other issue syzbot found that the tcp_close() handler missed locking the hash bucket lock which could result in corrupting the sockhash bucket list if delete and close ran at the same time. And also the smap_list_remove() routine was not working correctly at all. This was not caught in my testing because in general my tests (to date at least lets add some more robust selftest in bpf-next) do things in the "expected" order, create map, add socks, delete socks, then tear down maps. The tests we have that do the ops out of this order where only working on single maps not multi- maps so we never saw the issue. Thanks syzbot. The fix is to restructure the tcp_close() lock handling. And fix the obvious bug in smap_list_remove(). Finally, during review I noticed the release handler was omitted from the upstream code (patch 4) due to an incorrect merge conflict fix when I ported the code to latest bpf-next before submitting. This would leave references to the map around if the user never closes the map. v3: rework patches, dropping ESTABLISH check and adding rcu annotation along with the smap_list_remove fix v4: missed one more case where maps was being accessed without the sk_callback_lock, spoted by Martin as well. v5: changed to use a specific lock for maps and reduced callback lock so that it is only used to gaurd sk callbacks. I think this makes the logic a bit cleaner and avoids confusion ovoer what each lock is doing. Also big thanks to Martin for thorough review he caught at least one case where I missed a rcu_call(). ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | | * | | bpf: sockhash, add release routineJohn Fastabend2018-07-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add map_release_uref pointer to hashmap ops. This was dropped when original sockhash code was ported into bpf-next before initial commit. Fixes: 81110384441a ("bpf: sockmap, add hash map support") Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | | * | | bpf: sockhash fix omitted bucket lock in sock_closeJohn Fastabend2018-07-011-49/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | First the sk_callback_lock() was being used to protect both the sock callback hooks and the psock->maps list. This got overly convoluted after the addition of sockhash (in sockmap it made some sense because masp and callbacks were tightly coupled) so lets split out a specific lock for maps and only use the callback lock for its intended purpose. This fixes a couple cases where we missed using maps lock when it was in fact needed. Also this makes it easier to follow the code because now we can put the locking closer to the actual code its serializing. Next, in sock_hash_delete_elem() the pattern was as follows, sock_hash_delete_elem() [...] spin_lock(bucket_lock) l = lookup_elem_raw() if (l) hlist_del_rcu() write_lock(sk_callback_lock) .... destroy psock ... write_unlock(sk_callback_lock) spin_unlock(bucket_lock) The ordering is necessary because we only know the {p}sock after dereferencing the hash table which we can't do unless we have the bucket lock held. Once we have the bucket lock and the psock element it is deleted from the hashmap to ensure any other path doing a lookup will fail. Finally, the refcnt is decremented and if zero the psock is destroyed. In parallel with the above (or free'ing the map) a tcp close event may trigger tcp_close(). Which at the moment omits the bucket lock altogether (oops!) where the flow looks like this, bpf_tcp_close() [...] write_lock(sk_callback_lock) for each psock->maps // list of maps this sock is part of hlist_del_rcu(ref_hash_node); .... destroy psock ... write_unlock(sk_callback_lock) Obviously, and demonstrated by syzbot, this is broken because we can have multiple threads deleting entries via hlist_del_rcu(). To fix this we might be tempted to wrap the hlist operation in a bucket lock but that would create a lock inversion problem. In summary to follow locking rules the psocks maps list needs the sk_callback_lock (after this patch maps_lock) but we need the bucket lock to do the hlist_del_rcu. To resolve the lock inversion problem pop the head of the maps list repeatedly and remove the reference until no more are left. If a delete happens in parallel from the BPF API that is OK as well because it will do a similar action, lookup the lock in the map/hash, delete it from the map/hash, and dec the refcnt. We check for this case before doing a destroy on the psock to ensure we don't have two threads tearing down a psock. The new logic is as follows, bpf_tcp_close() e = psock_map_pop(psock->maps) // done with map lock bucket_lock() // lock hash list bucket l = lookup_elem_raw(head, hash, key, key_size); if (l) { //only get here if elmnt was not already removed hlist_del_rcu() ... destroy psock... } bucket_unlock() And finally for all the above to work add missing locking around map operations per above. Then add RCU annotations and use rcu_dereference/rcu_assign_pointer to manage values relying on RCU so that the object is not free'd from sock_hash_free() while it is being referenced in bpf_tcp_close(). Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com Fixes: 81110384441a ("bpf: sockmap, add hash map support") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | | * | | bpf: sockmap, fix smap_list_map_remove when psock is in many mapsJohn Fastabend2018-07-011-12/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a hashmap is free'd with open socks it removes the reference to the hash entry from the psock. If that is the last reference to the psock then it will also be free'd by the reference counting logic. However the current logic that removes the hash reference from the list of references is broken. In smap_list_remove() we first check if the sockmap entry matches and then check if the hashmap entry matches. But, the sockmap entry sill always match because its NULL in this case which causes the first entry to be removed from the list. If this is always the "right" entry (because the user adds/removes entries in order) then everything is OK but otherwise a subsequent bpf_tcp_close() may reference a free'd object. To fix this create two list handlers one for sockmap and one for sockhash. Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com Fixes: 81110384441a ("bpf: sockmap, add hash map support") Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | | * | | bpf: sockmap, fix crash when ipv6 sock is addedJohn Fastabend2018-07-011-10/+48
| | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a crash where we assign tcp_prot to IPv6 sockets instead of tcpv6_prot. Previously we overwrote the sk->prot field with tcp_prot even in the AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot are used. Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously crashing case here. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | Merge branch 'bpf-fixes'Alexei Starovoitov2018-06-294-78/+11
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== This set contains three fixes that are mostly JIT and set_memory_*() related. The third in the series in particular fixes the syzkaller bugs that were still pending; aside from local reproduction & testing, also 'syz test' wasn't able to trigger them anymore. I've tested this series on x86_64, arm64 and s390x, and kbuild bot wasn't yelling either for the rest. For details, please see patches as usual, thanks! ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | | * | | bpf: undo prog rejection on read-only lock failureDaniel Borkmann2018-06-292-77/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Partially undo commit 9facc336876f ("bpf: reject any prog that failed read-only lock") since it caused a regression, that is, syzkaller was able to manage to cause a panic via fault injection deep in set_memory_ro() path by letting an allocation fail: In x86's __change_page_attr_set_clr() it was able to change the attributes of the primary mapping but not in the alias mapping via cpa_process_alias(), so the second, inner call to the __change_page_attr() via __change_page_attr_set_clr() had to split a larger page and failed in the alloc_pages() with the artifically triggered allocation error which is then propagated down to the call site. Thus, for set_memory_ro() this means that it returned with an error, but from debugging a probe_kernel_write() revealed EFAULT on that memory since the primary mapping succeeded to get changed. Therefore the subsequent hdr->locked = 0 reset triggered the panic as it was performed on read-only memory, so call-site assumptions were infact wrong to assume that it would either succeed /or/ not succeed at all since there's no such rollback in set_memory_*() calls from partial change of mappings, in other words, we're left in a state that is "half done". A later undo via set_memory_rw() is succeeding though due to matching permissions on that part (aka due to the try_preserve_large_page() succeeding). While reproducing locally with explicitly triggering this error, the initial splitting only happens on rare occasions and in real world it would additionally need oom conditions, but that said, it could partially fail. Therefore, it is definitely wrong to bail out on set_memory_ro() error and reject the program with the set_memory_*() semantics we have today. Shouldn't have gone the extra mile since no other user in tree today infact checks for any set_memory_*() errors, e.g. neither module_enable_ro() / module_disable_ro() for module RO/NX handling which is mostly default these days nor kprobes core with alloc_insn_page() / free_insn_page() as examples that could be invoked long after bootup and original 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit against spraying attacks") did neither when it got first introduced to BPF so "improving" with bailing out was clearly not right when set_memory_*() cannot handle it today. Kees suggested that if set_memory_*() can fail, we should annotate it with __must_check, and all callers need to deal with it gracefully given those set_memory_*() markings aren't "advisory", but they're expected to actually do what they say. This might be an option worth to move forward in future but would at the same time require that set_memory_*() calls from supporting archs are guaranteed to be "atomic" in that they provide rollback if part of the range fails, once that happened, the transition from RW -> RO could be made more robust that way, while subsequent RO -> RW transition /must/ continue guaranteeing to always succeed the undo part. Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock") Cc: Laura Abbott <labbott@redhat.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | | * | | bpf, s390: fix potential memleak when later bpf_jit_prog failsDaniel Borkmann2018-06-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we would ever fail in the bpf_jit_prog() pass that writes the actual insns to the image after we got header via bpf_jit_binary_alloc() then we also need to make sure to free it through bpf_jit_binary_free() again when bailing out. Given we had prior bpf_jit_prog() passes to initially probe for clobbered registers, program size and to fill in addrs arrray for jump targets, this is more of a theoretical one, but at least make sure this doesn't break with future changes. Fixes: 054623105728 ("s390/bpf: Add s390x eBPF JIT compiler backend") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | | * | | bpf, arm32: fix to use bpf_jit_binary_lock_ro apiDaniel Borkmann2018-06-291-1/+1
| | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Any eBPF JIT that where its underlying arch supports ARCH_HAS_SET_MEMORY would need to use bpf_jit_binary_{un,}lock_ro() pair instead of the set_memory_{ro,rw}() pair directly as otherwise changes to the former might break. arm32's eBPF conversion missed to change it, so fix this up here. Fixes: 39c13c204bb1 ("arm: eBPF JIT compiler") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | * | | bpf: Change bpf_fib_lookup to return lookup statusDavid Ahern2018-06-293-41/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For ACLs implemented using either FIB rules or FIB entries, the BPF program needs the FIB lookup status to be able to drop the packet. Since the bpf_fib_lookup API has not reached a released kernel yet, change the return code to contain an encoding of the FIB lookup result and return the nexthop device index in the params struct. In addition, inform the BPF program of any post FIB lookup reason as to why the packet needs to go up the stack. The fib result for unicast routes must have an egress device, so remove the check that it is non-NULL. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | test_bpf: flag tests that cannot be jited on s390Kleber Sacilotto de Souza2018-06-281-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Flag with FLAG_EXPECTED_FAIL the BPF_MAXINSNS tests that cannot be jited on s390 because they exceed BPF_SIZE_MAX and fail when CONFIG_BPF_JIT_ALWAYS_ON is set. Also set .expected_errcode to -ENOTSUPP so the tests pass in that case. Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | selftests: bpf: notification about privilege required to run ↵Jeffrin Jose T2018-06-261-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | test_lwt_seg6local.sh testing script This test needs root privilege for it's successful execution. This patch is atleast used to notify the user about the privilege the script demands for the smooth execution of the test. Signed-off-by: Jeffrin Jose T (Rajagiri SET) <ahiliation@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | selftests: bpf: notification about privilege required to run ↵Jeffrin Jose T2018-06-261-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | test_lirc_mode2.sh testing script The test_lirc_mode2.sh script require root privilege for the successful execution of the test. This patch is to notify the user about the privilege the script demands for the successful execution of the test. Signed-off-by: Jeffrin Jose T (Rajagiri SET) <ahiliation@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | selftests: bpf: add missing NET_SCHED to configAnders Roxell2018-06-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CONFIG_NET_SCHED wasn't enabled in arm64's defconfig only for x86. So bpf/test_tunnel.sh tests fails with: RTNETLINK answers: Operation not supported RTNETLINK answers: Operation not supported We have an error talking to the kernel, -1 Enable NET_SCHED and more tests pass. Fixes: 3bce593ac06b ("selftests: bpf: config: add config fragments") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPFSean Young2018-06-267-92/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not possible to attach, detach or query IR BPF programs to /dev/lircN devices, making them impossible to use. For embedded devices, it should be possible to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled. This change requires some refactoring, since bpf_prog_{attach,detach,query} functions are now always compiled, but their code paths for cgroups need moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c, moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not require #ifdefs since that is already conditionally compiled. Fixes: f4364dcfc86d ("media: rc: introduce BPF_PROG_LIRC_MODE2") Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | nfp: bpf: don't stop offload if replace failedJakub Kicinski2018-06-251-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stopping offload completely if replace of program failed dates back to days of transparent offload. Back then we wanted to silently fall back to the in-driver processing. Today we mark programs for offload when they are loaded into the kernel, so the transparent offload is no longer a reality. Flags check in the driver will only allow replace of a driver program with another driver program or an offload program with another offload program. When driver program is replaced stopping offload is a no-op, because driver program isn't offloaded. When replacing offloaded program if the offload fails the entire operation will fail all the way back to user space and we should continue using the old program. IOW when replacing a driver program stopping offload is unnecessary and when replacing offloaded program - it's a bug, old program should continue to run. In practice this bug would mean that if offload operation was to fail (either due to FW communication error, kernel OOM or new program being offloaded but for a different netdev) driver would continue reporting that previous XDP program is offloaded but in fact no program will be loaded in hardware. The failure is fairly unlikely (found by inspection, when working on the code) but it's unpleasant. Backport note: even though the bug was introduced in commit cafa92ac2553 ("nfp: bpf: add support for XDP_FLAGS_HW_MODE"), this fix depends on commit 441a33031fe5 ("net: xdp: don't allow device-bound programs in driver mode"), so this fix is sufficient only in v4.15 or newer. Kernels v4.13.x and v4.14.x do need to stop offload if it was transparent/opportunistic, i.e. if XDP_FLAGS_HW_MODE was not set on running program. Fixes: cafa92ac2553 ("nfp: bpf: add support for XDP_FLAGS_HW_MODE") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | tools/bpf: fix test_sockmap failureYonghong Song2018-06-221-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On one of our production test machine, when running bpf selftest test_sockmap, I got the following error: # sudo ./test_sockmap libbpf: failed to create map (name: 'sock_map'): Operation not permitted libbpf: failed to load object 'test_sockmap_kern.o' libbpf: Can't get the 0th fd from program sk_skb1: only -1 instances ...... load_bpf_file: (-1) Operation not permitted ERROR: (-1) load bpf failed The error is due to not-big-enough rlimit struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY}; The test already includes "bpf_rlimit.h", which sets current and max rlimit to RLIM_INFINITY. Let us just use it. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | selftests: bpf: notification about privilege required to run test_kmod.sh ↵Jeffrin Jose T2018-06-221-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | testing script The test_kmod.sh script require root privilege for the successful execution of the test. This patch is to notify the user about the privilege the script demands for the successful execution of the test. Signed-off-by: Jeffrin Jose T (Rajagiri SET) <ahiliation@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | Merge branch 'bpf-bpftool-fixes'Daniel Borkmann2018-06-211-4/+8
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jakub Kicinski says: ==================== Two small fixes for error handling in bpftool prog load, first patch removes a duplicated message, second one frees resources correctly. Multiple error messages break JSON: { "error": "can't pin the object (/sys/fs/bpf/a): File exists" },{ "error": "failed to pin program" } ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | | * | | tools: bpftool: remember to close the libbpf object after prog loadJakub Kicinski2018-06-211-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remembering to close all descriptors and free memory may not seem important in a user space tool like bpftool, but if we were to run in batch mode the consumed resources start to add up quickly. Make sure program load closes the libbpf object (which unloads and frees it). Fixes: 49a086c201a9 ("bpftool: implement prog load command") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | | * | | tools: bpftool: remove duplicated error message on prog loadJakub Kicinski2018-06-211-3/+1
| | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_pin_fd() will already print out an error message if something goes wrong. Printing another error is unnecessary and will break JSON output, since error messages are full objects: $ bpftool -jp prog load tracex1_kern.o /sys/fs/bpf/a { "error": "can't pin the object (/sys/fs/bpf/a): File exists" },{ "error": "failed to pin program" } Fixes: 49a086c201a9 ("bpftool: implement prog load command") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | net: fib_rules: bring back rule_exists to match rule during addRoopa Prabhu2018-06-301-1/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule"), rule_exists got replaced by rule_find for existing rule lookup in both the add and del paths. While this is good for the delete path, it solves a few problems but opens up a few invalid key matches in the add path. $ip -4 rule add table main tos 10 fwmark 1 $ip -4 rule add table main tos 10 RTNETLINK answers: File exists The problem here is rule_find does not check if the key masks in the new and old rule are the same and hence ends up matching a more secific rule. Rule key masks cannot be easily compared today without an elaborate if-else block. Its best to introduce key masks for easier and accurate rule comparison in the future. Until then, due to fear of regressions this patch re-introduces older loose rule_exists during add. Also fixes both rule_exists and rule_find to cover missing attributes. Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | hv_netvsc: split sub-channel setup into async and syncStephen Hemminger2018-06-304-52/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing device hotplug the sub channel must be async to avoid deadlock issues because device is discovered in softirq context. When doing changes to MTU and number of channels, the setup must be synchronous to avoid races such as when MTU and device settings are done in a single ip command. Reported-by: Thomas Walker <Thomas.Walker@twosigma.com> Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug") Fixes: 732e49850c5e ("netvsc: fix race on sub channel creation") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: use dev_change_tx_queue_len() for SIOCSIFTXQLENCong Wang2018-06-301-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As noticed by Eric, we need to switch to the helper dev_change_tx_queue_len() for SIOCSIFTXQLEN call path too, otheriwse still miss dev_qdisc_change_tx_queue_len(). Fixes: 6a643ddb5624 ("net: introduce helper dev_change_tx_queue_len()") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | atm: zatm: Fix potential Spectre v1Gustavo A. R. Silva2018-06-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pool can be indirectly controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: drivers/atm/zatm.c:1491 zatm_ioctl() warn: potential spectre issue 'zatm_dev->pool_info' (local cap) Fix this by sanitizing pool before using it to index zatm_dev->pool_info Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | Merge branch 's390-qeth-fixes'David S. Miller2018-06-304-30/+57
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Julian Wiedmann says: ==================== s390/qeth: fixes 2018-06-29 please apply a few qeth fixes for -net and your 4.17 stable queue. Patches 1-3 fix several issues wrt to MAC address management that were introduced during the 4.17 cycle. Patch 4 tackles a long-standing issue with busy multi-connection workloads on devices in af_iucv mode. Patch 5 makes sure to re-enable all active HW offloads, after a card was previously set offline and thus lost its HW context. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | s390/qeth: consistently re-enable device featuresJulian Wiedmann2018-06-304-17/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e830baa9c3f0 ("qeth: restore device features after recovery") and commit ce3443564145 ("s390/qeth: rely on kernel for feature recovery") made sure that the HW functions for device features get re-programmed after recovery. But we missed that the same handling is also required when a card is first set offline (destroying all HW context), and then online again. Fix this by moving the re-enable action out of the recovery-only path. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | s390/qeth: don't clobber buffer on async TX completionJulian Wiedmann2018-06-302-6/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If qeth_qdio_output_handler() detects that a transmit requires async completion, it replaces the pending buffer's metadata object (qeth_qdio_out_buffer) so that this queue buffer can be re-used while the data is pending completion. Later when the CQ indicates async completion of such a metadata object, qeth_qdio_cq_handler() tries to free any data associated with this object (since HW has now completed the transfer). By calling qeth_clear_output_buffer(), it erronously operates on the queue buffer that _previously_ belonged to this transfer ... but which has been potentially re-used several times by now. This results in double-free's of the buffer's data, and failing transmits as the buffer descriptor is scrubbed in mid-air. The correct way of handling this situation is to 1. scrub the queue buffer when it is prepared for re-use, and 2. later obtain the data addresses from the async-completion notifier (ie. the AOB), instead of the queue buffer. All this only affects qeth devices used for af_iucv HiperTransport. Fixes: 0da9581ddb0f ("qeth: exploit asynchronous delivery of storage blocks") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]Vasily Gorbik2018-06-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | *ether_addr*_64bits functions have been introduced to optimize performance critical paths, which access 6-byte ethernet address as u64 value to get "nice" assembly. A harmless hack works nicely on ethernet addresses shoved into a structure or a larger buffer, until busted by Kasan on smth like plain (u8 *)[6]. qeth_l2_set_mac_address calls qeth_l2_remove_mac passing u8 old_addr[ETH_ALEN] as an argument. Adding/removing macs for an ethernet adapter is not that performance critical. Moreover is_multicast_ether_addr_64bits itself on s390 is not faster than is_multicast_ether_addr: is_multicast_ether_addr(%r2) -> %r2 llc %r2,0(%r2) risbg %r2,%r2,63,191,0 is_multicast_ether_addr_64bits(%r2) -> %r2 llgc %r2,0(%r2) risbg %r2,%r2,63,191,0 So, let's just use is_multicast_ether_addr instead of is_multicast_ether_addr_64bits. Fixes: bcacfcbc82b4 ("s390/qeth: fix MAC address update sequence") Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | s390/qeth: fix race when setting MAC addressJulian Wiedmann2018-06-301-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When qeth_l2_set_mac_address() finds the card in a non-reachable state, it merely copies the new MAC address into dev->dev_addr so that __qeth_l2_set_online() can later register it with the HW. But __qeth_l2_set_online() may very well be running concurrently, so we can't trust the card state without appropriate locking: If the online sequence is past the point where it registers dev->dev_addr (but not yet in SOFTSETUP state), any address change needs to be properly programmed into the HW. Otherwise the netdevice ends up with a different MAC address than what's set in the HW, and inbound traffic is not forwarded as expected. This is most likely to occur for OSD in LPAR, where commit 21b1702af12e ("s390/qeth: improve fallback to random MAC address") now triggers eg. systemd to immediately change the MAC when the netdevice is registered with a NET_ADDR_RANDOM address. Fixes: bcacfcbc82b4 ("s390/qeth: fix MAC address update sequence") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>