summaryrefslogtreecommitdiffstats
path: root/include (follow)
Commit message (Collapse)AuthorAgeFilesLines
* NTB: Add PCIe Gen4 link speedSerge Semin2017-07-061-0/+2
| | | | | | Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Acked-by: Allen Hubbe <Allen.Hubbe@dell.com> Signed-off-by: Jon Mason <jdmason@kudzu.us>
* NTB: Add Messaging NTB APISerge Semin2017-07-061-0/+205
| | | | | | | | | | | | | | | | | | | | | | | | | | Some IDT NTB-capable PCIe-switches have message registers to communicate with peer devices. This patch adds new NTB API callback methods, which can be used to utilize these registers functionality: ntb_msg_count(); - get number of message registers ntb_msg_inbits(); - get bitfield of inbound message registers status ntb_msg_outbits(); - get bitfield of outbound message registers status ntb_msg_read_sts(); - read the inbound and outbound message registers status ntb_msg_clear_sts(); - clear status bits of message registers ntb_msg_set_mask(); - mask interrupts raised by status bits of message registers. ntb_msg_clear_mask(); - clear interrupts mask bits of message registers ntb_msg_read(midx, *pidx); - read message register with specified index, additionally getting peer port index which data received from ntb_msg_write(midx, pidx); - write data to the specified message register sending it to the passed peer device connected over a pidx port ntb_msg_event(); - notify driver context of a new message event Of course there is hardware which doesn't support Message registers, so this API is made optional. Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Acked-by: Allen Hubbe <Allen.Hubbe@dell.com> Signed-off-by: Jon Mason <jdmason@kudzu.us>
* NTB: Alter Scratchpads API to support multi-ports devicesSerge Semin2017-07-061-28/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | Even though there is no any real NTB hardware, which would have both more than two ports and Scratchpad registers, it is logically correct to have Scratchpad API accepting a peer port index as well. Intel/AMD drivers utilize Primary and Secondary topology to split Scratchpad between connected root devices. Since port-index API introduced, Intel/AMD NTB hardware drivers can use device port to determine which Scratchpad registers actually belong to local and peer devices. The same approach can be used if some potential hardware in future will be multi-port and have some set of Scratchpads. Here are the brief of changes in the API: ntb_spad_count() - return number of Scratchpads per each port ntb_peer_spad_addr(pidx, sidx) - address of Scratchpad register of the peer device with pidx-index ntb_peer_spad_read(pidx, sidx) - read specified Scratchpad register of the peer with pidx-index ntb_peer_spad_write(pidx, sidx) - write data to Scratchpad register of the peer with pidx-index Since there is hardware which doesn't support Scratchpad registers, the corresponding API methods are now made optional. Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Acked-by: Allen Hubbe <Allen.Hubbe@dell.com> Signed-off-by: Jon Mason <jdmason@kudzu.us>
* NTB: Alter MW API to support multi-ports devicesSerge Semin2017-07-061-45/+163
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Multi-port NTB devices permit to share a memory between all accessible peers. Memory Windows API is altered to correspondingly initialize and map memory windows for such devices: ntb_mw_count(pidx); - number of inbound memory windows, which can be allocated for shared buffer with specified peer device. ntb_mw_get_align(pidx, widx); - get alignment and size restriction parameters to properly allocate inbound memory region. ntb_peer_mw_count(); - get number of outbound memory windows. ntb_peer_mw_get_addr(widx); - get mapping address of an outbound memory window If hardware supports inbound translation configured on the local ntb port: ntb_mw_set_trans(pidx, widx); - set translation address of allocated inbound memory window so a peer device could access it. ntb_mw_clear_trans(pidx, widx); - clear the translation address of an inbound memory window. If hardware supports outbound translation configured on the peer ntb port: ntb_peer_mw_set_trans(pidx, widx); - set translation address of a memory window retrieved from a peer device ntb_peer_mw_clear_trans(pidx, widx); - clear the translation address of an outbound memory window Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Acked-by: Allen Hubbe <Allen.Hubbe@dell.com> Signed-off-by: Jon Mason <jdmason@kudzu.us>
* NTB: Alter link-state API to support multi-port devicesSerge Semin2017-07-061-15/+16
| | | | | | | | | | | Multi-port devices permit the NTB connections between multiple domains, so a local device can have NTB link being up with one peer and being down with another. NTB link-state API is appropriately altered to return a bitfield of the link-states between the local device and possible peers. Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Acked-by: Allen Hubbe <Allen.Hubbe@dell.com> Signed-off-by: Jon Mason <jdmason@kudzu.us>
* NTB: Add indexed ports NTB APISerge Semin2017-07-061-0/+156
| | | | | | | | | | | | | | | | | | | | | There is some NTB hardware, which can combine more than just two domains over NTB. For instance, some IDT PCIe-switches can have NTB-functions activated on more than two-ports. The different domains are distinguished by ports they are connected to. So the new port-related methods are added to the NTB API: ntb_port_number() - return local port ntb_peer_port_count() - return number of peers local port can connect to ntb_peer_port_number(pdix) - return port number by it index ntb_peer_port_idx(port) - return port index by it number Current test-drivers aren't changed much. They still support two-ports devices for the time being while multi-ports hardware drivers aren't added. By default port-related API is declared for two-ports hardware. So corresponding hardware drivers won't need to implement it. Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Signed-off-by: Jon Mason <jdmason@kudzu.us>
* NTB: Make link-state API being declared firstSerge Semin2017-07-061-68/+69
| | | | | | | | | Since link operations are usually performed before memory window access operations, it's logically better to declare link-related API before any of MW/Doorbell/Scratchpad methods. Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Signed-off-by: Jon Mason <jdmason@kudzu.us>
* moduleparam: fix doc: hwparam_irq configures an IRQSylvain 'ythier' Hitier2017-07-031-1/+1
| | | | | Signed-off-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* uapi/linux/a.out.h: don't use deprecated system-specific predefines.Zack Weinberg2017-06-301-25/+1
| | | | | | | | | | | | | | | | | | | | | | uapi/linux/a.out.h uses a number of predefined macros that are deprecated because they're in the application namespace (e.g. '#ifdef linux' instead of '#ifdef __linux__'). This patch either corrects or just removes them if they are not applicable to Linux. The primary reason this is worth bothering to fix, considering how obsolete a.out binary support is, is that the GCC build process considers this such a severe error that it will copy the header into a private directory and change the macro names, which causes future updates to the header to be masked. This header probably doesn't get updated very often anymore, but it is the _only_ uapi header that gets this treatment, so IMHO it is worth patching just to drive that number all the way to zero. Signed-off-by: Zack Weinberg <zackw@panix.com> [hch: removed dead conditionals] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hashtable: remove repeated phrase from a commentJakub Kicinski2017-06-301-1/+0
| | | | | | | "in a rcu enabled hashtable" is repeated twice in a comment. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2017-06-291-5/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Need to access netdev->num_rx_queues behind an accessor in netvsc driver otherwise the build breaks with some configs, from Arnd Bergmann. 2) Add dummy xfrm_dev_event() so that build doesn't fail when CONFIG_XFRM_OFFLOAD is not set. From Hangbin Liu. 3) Don't OOPS when pfkey_msg2xfrm_state() signals an erros, from Dan Carpenter. 4) Fix MCDI command size for filter operations in sfc driver, from Martin Habets. 5) Fix UFO segmenting so that we don't calculate incorrect checksums, from Michal Kubecek. 6) When ipv6 datagram connects fail, reset destination address and port. From Wei Wang. 7) TCP disconnect must reset the cached receive DST, from WANG Cong. 8) Fix sign extension bug on 32-bit in dev_get_stats(), from Eric Dumazet. 9) fman driver has to depend on HAS_DMA, from Madalin Bucur. 10) Fix bpf pointer leak with xadd in verifier, from Daniel Borkmann. 11) Fix negative page counts with GFO, from Michal Kubecek. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits) sfc: fix attempt to translate invalid filter ID net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() bpf: prevent leaking pointer via xadd on unpriviledged arcnet: com20020-pci: add missing pdev setup in netdev structure arcnet: com20020-pci: fix dev_id calculation arcnet: com20020: remove needless base_addr assignment Trivial fix to spelling mistake in arc_printk message arcnet: change irq handler to lock irqsave rocker: move dereference before free mlxsw: spectrum_router: Fix NULL pointer dereference net: sched: Fix one possible panic when no destroy callback virtio-net: serialize tx routine during reset net: usb: asix88179_178a: Add support for the Belkin B2B128 fsl/fman: add dependency on HAS_DMA net: prevent sign extension in dev_get_stats() tcp: reset sk_rx_dst in tcp_disconnect() net: ipv6: reset daddr and dport in sk if connect() fails bnx2x: Don't log mc removal needlessly bnxt_en: Fix netpoll handling. bnxt_en: Add missing logic to handle TPA end error conditions. ...
| * Merge branch 'master' of ↵David S. Miller2017-06-231-5/+2
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2017-06-23 1) Fix xfrm garbage collecting when unregistering a netdevice. From Hangbin Liu. 2) Fix NULL pointer derefernce when exiting a network namespace. From Hangbin Liu. 3) Fix some error codes in pfkey to prevent a NULL pointer derefernce. From Dan Carpenter. 4) Fix NULL pointer derefernce on allocation failure in pfkey. From Dan Carpenter. 5) Adjust IPv6 payload_len to include extension headers. Otherwise we corrupt the packets when doing ESP GRO on transport mode. From Yossi Kuperman. 6) Set nhoff to the proper offset of the IPv6 nexthdr when doing ESP GRO. From Yossi Kuperman. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * xfrm: fix xfrm_dev_event() missing when compile without CONFIG_XFRM_OFFLOADHangbin Liu2017-06-071-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API") we make xfrm_device.o only compiled when enable option CONFIG_XFRM_OFFLOAD. But this will make xfrm_dev_event() missing if we only enable default XFRM options. Then if we set down and unregister an interface with IPsec on it. there will no xfrm_garbage_collect(), which will cause dev usage count hold and get error like: unregister_netdevice: waiting for <dev> to become free. Usage count = 4 Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | block: provide bio_uninit() free freeing integrity/task associationsJens Axboe2017-06-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Wen reports significant memory leaks with DIF and O_DIRECT: "With nvme devive + T10 enabled, On a system it has 256GB and started logging /proc/meminfo & /proc/slabinfo for every minute and in an hour it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute leaking. /proc/meminfo | grep SUnreclaim... SUnreclaim: 6752128 kB SUnreclaim: 6874880 kB SUnreclaim: 7238080 kB .... SUnreclaim: 22307264 kB SUnreclaim: 22485888 kB SUnreclaim: 22720256 kB When testcases with T10 enabled call into __blkdev_direct_IO_simple, code doesn't free memory allocated by bio_integrity_alloc. The patch fixes the issue. HTX has been run with +60 hours without failure." Since __blkdev_direct_IO_simple() allocates the bio on the stack, it doesn't go through the regular bio free. This means that any ancillary data allocated with the bio through the stack is not freed. Hence, we can leak the integrity data associated with the bio, if the device is using DIF/DIX. Fix this by providing a bio_uninit() and export it, so that we can use it to free this data. Note that this is a minimal fix for this issue. Any current user of bio's that are allocated outside of bio_alloc_bioset() suffers from this issue, most notably some drivers. We will fix those in a more comprehensive patch for 4.13. This also means that the commit marked as being fixed by this isn't the real culprit, it's just the most obvious one out there. Fixes: 542ff7bf18c6 ("block: new direct I/O implementation") Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2017-06-251-3/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A few fixes for timekeeping and timers: - Plug a subtle race due to a missing READ_ONCE() in the timekeeping code where reloading of a pointer results in an inconsistent callback argument being supplied to the clocksource->read function. - Correct the CLOCK_MONOTONIC_RAW sub-nanosecond accounting in the time keeping core code, to prevent a possible discontuity. - Apply a similar fix to the arm64 vdso clock_gettime() implementation - Add missing includes to clocksource drivers, which relied on indirect includes which fails in certain configs. - Use the proper iomem pointer for read/iounmap in a probe function" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arm64/vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting time: Fix clock->read(clock) race around clocksource changes clocksource: Explicitly include linux/clocksource.h when needed clocksource/drivers/arm_arch_timer: Fix read and iounmap of incorrect variable
| * | | time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accountingJohn Stultz2017-06-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to how the MONOTONIC_RAW accumulation logic was handled, there is the potential for a 1ns discontinuity when we do accumulations. This small discontinuity has for the most part gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW in their vDSO clock_gettime implementation, we've seen failures with the inconsistency-check test in kselftest. This patch addresses the issue by using the same sub-ns accumulation handling that CLOCK_MONOTONIC uses, which avoids the issue for in-kernel users. Since the ARM64 vDSO implementation has its own clock_gettime calculation logic, this patch reduces the frequency of errors, but failures are still seen. The ARM64 vDSO will need to be updated to include the sub-nanosecond xtime_nsec values in its calculation for this issue to be completely fixed. Signed-off-by: John Stultz <john.stultz@linaro.org> Tested-by: Daniel Mentz <danielmentz@google.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: "stable #4 . 8+" <stable@vger.kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | time: Fix clock->read(clock) race around clocksource changesJohn Stultz2017-06-201-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In tests, which excercise switching of clocksources, a NULL pointer dereference can be observed on AMR64 platforms in the clocksource read() function: u64 clocksource_mmio_readl_down(struct clocksource *c) { return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask; } This is called from the core timekeeping code via: cycle_now = tkr->read(tkr->clock); tkr->read is the cached tkr->clock->read() function pointer. When the clocksource is changed then tkr->clock and tkr->read are updated sequentially. The code above results in a sequential load operation of tkr->read and tkr->clock as well. If the store to tkr->clock hits between the loads of tkr->read and tkr->clock, then the old read() function is called with the new clock pointer. As a consequence the read() function dereferences a different data structure and the resulting 'reg' pointer can point anywhere including NULL. This problem was introduced when the timekeeping code was switched over to use struct tk_read_base. Before that, it was theoretically possible as well when the compiler decided to reload clock in the code sequence: now = tk->clock->read(tk->clock); Add a helper function which avoids the issue by reading tk_read_base->clock once into a local variable clk and then issue the read function via clk->read(clk). This guarantees that the read() function always gets the proper clocksource pointer handed in. Since there is now no use for the tkr.read pointer, this patch also removes it, and to address stopping the fast timekeeper during suspend/resume, it introduces a dummy clocksource to use rather then just a dummy read function. Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: stable <stable@vger.kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Daniel Mentz <danielmentz@google.com> Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | | Merge tag 'acpi-4.12-rc7' of ↵Linus Torvalds2017-06-241-1/+2
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fix from Rafael Wysocki: "This fixes the ACPI-based enumeration of some I2C and SPI devices broken in 4.11. Specifics: - I2C and SPI devices are expected to be enumerated by the I2C and SPI subsystems, respectively, but due to a change made during the 4.11 cycle, in some cases the ACPI core marks them as already enumerated which causes the I2C and SPI subsystems to overlook them, so fix that (Jarkko Nikula)" * tag 'acpi-4.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / scan: Fix enumeration for special SPI and I2C devices
| * | | | ACPI / scan: Fix enumeration for special SPI and I2C devicesJarkko Nikula2017-06-211-1/+2
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f406270bf73d ("ACPI / scan: Set the visited flag for all enumerated devices") caused that two group of special SPI or I2C devices do not enumerate. SPI and I2C devices are expected to be enumerated by the SPI and I2C subsystems but change caused that acpi_bus_attach() marks those devices with acpi_device_set_enumerated(). First group of devices are matched using Device Tree compatible property with special _HID "PRP0001". Those devices have matched scan handler, acpi_scan_attach_handler() retuns 1 and acpi_bus_attach() marks them with acpi_device_set_enumerated(). Second group of devices without valid _HID such as "LNXVIDEO" have device->pnp.type.platform_id set to zero and change again marks them with acpi_device_set_enumerated(). Fix this by flagging the SPI and I2C devices during struct acpi_device object initialization time and let the code in acpi_bus_attach() to go through the device_attach() and acpi_default_enumeration() path for all SPI and I2C devices. Fixes: f406270bf73d (ACPI / scan: Set the visited flag for all enumerated devices) Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: 4.11+ <stable@vger.kernel.org> # 4.11+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | | slub: make sysfs file removal asynchronousTejun Heo2017-06-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()") made slub sysfs file removals synchronous to kmem_cache shutdown. Unfortunately, this created a possible ABBA deadlock between slab_mutex and sysfs draining mechanism triggering the following lockdep warning. ====================================================== [ INFO: possible circular locking dependency detected ] 4.10.0-test+ #48 Not tainted ------------------------------------------------------- rmmod/1211 is trying to acquire lock: (s_active#120){++++.+}, at: [<ffffffff81308073>] kernfs_remove+0x23/0x40 but task is already holding lock: (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (slab_mutex){+.+.+.}: lock_acquire+0xf6/0x1f0 __mutex_lock+0x75/0x950 mutex_lock_nested+0x1b/0x20 slab_attr_store+0x75/0xd0 sysfs_kf_write+0x45/0x60 kernfs_fop_write+0x13c/0x1c0 __vfs_write+0x28/0x120 vfs_write+0xc8/0x1e0 SyS_write+0x49/0xa0 entry_SYSCALL_64_fastpath+0x1f/0xc2 -> #0 (s_active#120){++++.+}: __lock_acquire+0x10ed/0x1260 lock_acquire+0xf6/0x1f0 __kernfs_remove+0x254/0x320 kernfs_remove+0x23/0x40 sysfs_remove_dir+0x51/0x80 kobject_del+0x18/0x50 __kmem_cache_shutdown+0x3e6/0x460 kmem_cache_destroy+0x1fb/0x2d0 kvm_exit+0x2d/0x80 [kvm] vmx_exit+0x19/0xa1b [kvm_intel] SyS_delete_module+0x198/0x1f0 entry_SYSCALL_64_fastpath+0x1f/0xc2 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(slab_mutex); lock(s_active#120); lock(slab_mutex); lock(s_active#120); *** DEADLOCK *** 2 locks held by rmmod/1211: #0: (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a7877>] get_online_cpus+0x37/0x80 #1: (slab_mutex){+.+.+.}, at: [<ffffffff8120f691>] kmem_cache_destroy+0x41/0x2d0 stack backtrace: CPU: 3 PID: 1211 Comm: rmmod Not tainted 4.10.0-test+ #48 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 Call Trace: print_circular_bug+0x1be/0x210 __lock_acquire+0x10ed/0x1260 lock_acquire+0xf6/0x1f0 __kernfs_remove+0x254/0x320 kernfs_remove+0x23/0x40 sysfs_remove_dir+0x51/0x80 kobject_del+0x18/0x50 __kmem_cache_shutdown+0x3e6/0x460 kmem_cache_destroy+0x1fb/0x2d0 kvm_exit+0x2d/0x80 [kvm] vmx_exit+0x19/0xa1b [kvm_intel] SyS_delete_module+0x198/0x1f0 ? SyS_delete_module+0x5/0x1f0 entry_SYSCALL_64_fastpath+0x1f/0xc2 It'd be the cleanest to deal with the issue by removing sysfs files without holding slab_mutex before the rest of shutdown; however, given the current code structure, it is pretty difficult to do so. This patch punts sysfs file removal to a work item. Before commit bf5eb3de3847, the removal was punted to a RCU delayed work item which is executed after release. Now, we're punting to a different work item on shutdown which still maintains the goal removing the sysfs files earlier when destroying kmem_caches. Link: http://lkml.kernel.org/r/20170620204512.GI21326@htj.duckdns.org Fixes: bf5eb3de3847 ("slub: separate out sysfs_slab_release() from sysfs_slab_remove()") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2017-06-221-0/+2
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "This contains a set of fixes for xen-blkback by way of Konrad, and a performance regression fix for blk-mq for shared tags. The latter could account for as much as a 50x reduction in performance, with the test case from the user with 500 name spaces. A more realistic setup on my end with 32 drives showed a 3.5x drop. The fix has been thoroughly tested before being committed" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: fix performance regression with shared tags xen-blkback: don't leak stack data via response ring xen/blkback: don't use xen_blkif_get() in xen-blkback kthread xen/blkback: don't free be structure too early xen/blkback: fix disconnect while I/Os in flight
| * | | blk-mq: fix performance regression with shared tagsJens Axboe2017-06-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have shared tags enabled, then every IO completion will trigger a full loop of every queue belonging to a tag set, and every hardware queue for each of those queues, even if nothing needs to be done. This causes a massive performance regression if you have a lot of shared devices. Instead of doing this huge full scan on every IO, add an atomic counter to the main queue that tracks how many hardware queues have been marked as needing a restart. With that, we can avoid looking for restartable queues, if we don't have to. Max reports that this restores performance. Before this patch, 4K IOPS was limited to 22-23K IOPS. With the patch, we are running at 950-970K IOPS. Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2017-06-211-2/+2
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix refcounting wrt timers which hold onto inet6 address objects, from Xin Long. 2) Fix an ancient bug in wireless wext ioctls, from Johannes Berg. 3) Firmware handling fixes in brcm80211 driver, from Arend Van Spriel. 4) Several mlx5 driver fixes (firmware readiness, timestamp cap reporting, devlink command validity checking, tc offloading, etc.) From Eli Cohen, Maor Dickman, Chris Mi, and Or Gerlitz. 5) Fix dst leak in IP/IP6 tunnels, from Haishuang Yan. 6) Fix dst refcount bug in decnet, from Wei Wang. 7) Netdev can be double freed in register_vlan_device(). Fix from Gao Feng. 8) Don't allow object to be destroyed while it is being dumped in SCTP, from Xin Long. 9) Fix dpaa_eth build when modular, from Madalin Bucur. 10) Fix throw route leaks, from Serhey Popovych. 11) IFLA_GROUP missing from if_nlmsg_size() and ifla_policy[] table, also from Serhey Popovych. 12) Fix premature TX SKB free in stmmac, from Niklas Cassel. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (36 commits) igmp: add a missing spin_lock_init() net: stmmac: free an skb first when there are no longer any descriptors using it sfc: remove duplicate up_write on VF filter_sem rtnetlink: add IFLA_GROUP to ifla_policy ipv6: Do not leak throw route references dt-bindings: net: sms911x: Add missing optional VDD regulators dpaa_eth: reuse the dma_ops provided by the FMan MAC device fsl/fman: propagate dma_ops net/core: remove explicit do_softirq() from busy_poll_stop() fib_rules: Resolve goto rules target on delete sctp: ensure ep is not destroyed before doing the dump net/hns:bugfix of ethtool -t phy self_test net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev cxgb4: notify uP to route ctrlq compl to rdma rspq ip6_tunnel: Correct tos value in collect_md mode decnet: always not take dst->__refcnt when inserting dst into hash table ip6_tunnel: fix potential issue in __ip6_tnl_rcv ip_tunnel: fix potential issue in ip_tunnel_rcv brcmfmac: fix uninitialized warning in brcmf_usb_probe_phase2() net/mlx5e: Avoid doing a cleanup call if the profile doesn't have it ...
| * \ \ \ Merge tag 'mac80211-for-davem-2017-06-16' of ↵David S. Miller2017-06-191-2/+2
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Here's just the fix for that ancient bug: * remove wext calling ndo_do_ioctl, since nobody needs that now and it makes the type change easier * use struct iwreq instead of struct ifreq almost everywhere in wireless extensions code * copy only struct iwreq from userspace in dev_ioctl for the wireless extensions, since it's smaller than struct ifreq ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | dev_ioctl: copy only the smaller struct iwreq for wextJohannes Berg2017-06-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately, struct iwreq isn't a proper subset of struct ifreq, but is still handled by the same code path. Robert reported that then applications may (randomly) fault if the struct iwreq they pass happens to land within 8 bytes of the end of a mapping (the struct is only 32 bytes, vs. struct ifreq's 40 bytes). To fix this, pull out the code handling wireless extension ioctls and copy only the smaller structure in this case. This bug goes back a long time, I tracked that it was introduced into mainline in 2.1.15, over 20 years ago! This fixes https://bugzilla.kernel.org/show_bug.cgi?id=195869 Reported-by: Robert O'Callahan <robert@ocallahan.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* | | | | | Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds2017-06-202-0/+4
|\ \ \ \ \ \ | |_|_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "One build fix for an Amlogic clk driver and a handful of Allwinner clk driver fixes for some DT bindings and a randconfig build error that all came in this merge window" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: sunxi-ng: a64: Export PLL_PERIPH0 clock for the PRCM clk: sunxi-ng: h3: Export PLL_PERIPH0 clock for the PRCM dt-bindings: clock: sunxi-ccu: Add pll-periph to PRCM's needed clocks clk: sunxi-ng: sun5i: Fix ahb_bist_clk definition clk: sunxi-ng: enable SUNXI_CCU_MP for PRCM clk: meson: gxbb: fix build error without RESET_CONTROLLER clk: sunxi-ng: v3s: Fix usb otg device reset bit clk: sunxi-ng: a31: Correct lcd1-ch1 clock register offset
| * | | | | clk: sunxi-ng: a64: Export PLL_PERIPH0 clock for the PRCMChen-Yu Tsai2017-05-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PRCM takes PLL_PERIPH0 as one of its parents for the AR100 clock. As such we need to be able to describe this relationship in the device tree. Export the PLL_PERIPH0 clock so we can reference it in the PRCM node. Signed-off-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
| * | | | | clk: sunxi-ng: h3: Export PLL_PERIPH0 clock for the PRCMChen-Yu Tsai2017-05-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PRCM takes PLL_PERIPH0 as one of its parents for the AR100 clock. As such we need to be able to describe this relationship in the device tree. Export the PLL_PERIPH0 clock so we can reference it in the PRCM node. Signed-off-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com>
* | | | | | mm: larger stack guard gap, between vmasHugh Dickins2017-06-191-28/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | Merge tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfsLinus Torvalds2017-06-161-1/+2
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull configfs updates from Christoph Hellwig: "A fix from Nic for a race seen in production (including a stable tag). And while I'm sending you this I'm also sneaking in a trivial new helper from Bart so that we don't need inter-tree dependencies for the next merge window" * tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs: configfs: Introduce config_item_get_unless_zero() configfs: Fix race between create_link and configfs_rmdir
| * | | | | | configfs: Introduce config_item_get_unless_zero()Bart Van Assche2017-06-121-1/+2
| | |_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> [hch: minor style tweak] Signed-off-by: Christoph Hellwig <hch@lst.de>
* | | | | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2017-06-161-0/+2
|\ \ \ \ \ \ | | |_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer fix from Jens Axboe: "Just a single fix this week, fixing a regression introduced in this release. When we put the final reference to the queue, we may need to block. Ensure that we can safely do so. From Bart" * 'for-linus' of git://git.kernel.dk/linux-block: block: Fix a blk_exit_rl() regression
| * | | | | block: Fix a blk_exit_rl() regressionBart Van Assche2017-06-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid that the following complaint is reported: BUG: sleeping function called from invalid context at kernel/workqueue.c:2790 in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3 1 lock held by rcuop/3/41: #0: (rcu_callback){......}, at: [<ffffffff8111f9a2>] rcu_nocb_kthread+0x282/0x500 Call Trace: dump_stack+0x86/0xcf ___might_sleep+0x174/0x260 __might_sleep+0x4a/0x80 flush_work+0x7e/0x2e0 __cancel_work_timer+0x143/0x1c0 cancel_work_sync+0x10/0x20 blk_throtl_exit+0x25/0x60 blkcg_exit_queue+0x35/0x40 blk_release_queue+0x42/0x130 kobject_put+0xa9/0x190 This happens since we invoke callbacks that need to block from the queue release handler. Fix this by pushing the final release to a workqueue. Reported-by: Ross Zwisler <zwisler@gmail.com> Fixes: commit b425e5049258 ("block: Avoid that blk_exit_rl() triggers a use-after-free") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Updated changelog Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | | Merge branch 'dmi-for-linus' of ↵Linus Torvalds2017-06-161-1/+1
|\ \ \ \ \ \ | |_|_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging Pull dmi fixes from Jean Delvare. * 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging: firmware: dmi_scan: Check DMI structure length firmware: dmi: Fix permissions of product_family firmware: dmi_scan: Make dmi_walk and dmi_walk_early return real error codes firmware: dmi_scan: Look for SMBIOS 3 entry point first
| * | | | | firmware: dmi_scan: Make dmi_walk and dmi_walk_early return real error codesAndy Lutomirski2017-06-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently they return -1 on error, which will confuse callers if they try to interpret it as a normal negative error code. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Darren Hart (VMware) <dvhart@infradead.org> Signed-off-by: Jean Delvare <jdelvare@suse.de>
* | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2017-06-153-7/+15
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) The netlink attribute passed in to dev_set_alias() is not necessarily NULL terminated, don't use strlcpy() on it. From Alexander Potapenko. 2) Fix implementation of atomics in arm64 bpf JIT, from Daniel Borkmann. 3) Correct the release of netdevs and driver private data in certain circumstances. 4) Sanitize netlink message length properly in decnet, from Mateusz Jurczyk. 5) Don't leak kernel data in rtnl_fill_vfinfo() netlink blobs. From Yuval Mintz. 6) Hash secret is never initialized in ipv6 ILA translation code, from Arnd Bergmann. I guess those clang warnings about unused inline functions are useful for something! 7) Fix endian selection in bpf_endian.h, from Daniel Borkmann. 8) Sanitize sockaddr length before dereferncing any fields in AF_UNIX and CAIF. From Mateusz Jurczyk. 9) Fix timestamping for GMAC3 chips in stmmac driver, from Mario Molitor. 10) Do not leak netdev on dev_alloc_name() errors in mac80211, from Johannes Berg. 11) Fix locking in sctp_for_each_endpoint(), from Xin Long. 12) Fix wrong memset size on 32-bit in snmp6, from Christian Perle. 13) Fix use after free in ip_mc_clear_src(), from WANG Cong. 14) Fix regressions caused by ICMP rate limiting changes in 4.11, from Jesper Dangaard Brouer. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (91 commits) i40e: Fix a sleep-in-atomic bug net: don't global ICMP rate limit packets originating from loopback net/act_pedit: fix an error code net: update undefined ->ndo_change_mtu() comment net_sched: move tcf_lock down after gen_replace_estimator() caif: Add sockaddr length check before accessing sa_family in connect handler qed: fix dump of context data qmi_wwan: new Telewell and Sierra device IDs net: phy: Fix MDIO_THUNDER dependencies netconsole: Remove duplicate "netconsole: " logging prefix igmp: acquire pmc lock for ip_mc_clear_src() r8152: give the device version net: rps: fix uninitialized symbol warning mac80211: don't send SMPS action frame in AP mode when not needed mac80211/wpa: use constant time memory comparison for MACs mac80211: set bss_info data before configuring the channel mac80211: remove 5/10 MHz rate code from station MLME mac80211: Fix incorrect condition when checking rx timestamp mac80211: don't look at the PM bit of BAR frames i40e: fix handling of HW ATR eviction ...
| * | | | | | net: update undefined ->ndo_change_mtu() commentMagnus Damm2017-06-141-2/+1
| | |_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update ->ndo_change_mtu() callback comment to remove text about returning error in case of undefined callback. This change makes the comment match the existing code behavior. Signed-off-by: Magnus Damm <damm+renesas@opensource.se> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | ethtool.h: remind to update 802.3ad when adding new speedsNicolas Dichtel2017-06-081-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each time a new speed is added, the bonding 802.3ad isn't updated. Add a comment to remind the developer to update this driver. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | openvswitch: warn about missing first netlink attributeNicolas Dichtel2017-06-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first netlink attribute (value 0) must always be defined as none/unspec. Because we cannot change an existing UAPI, I add a comment to point the mistake and avoid to propagate it in a new ovs API in the future. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: ipv6: Release route when device is unregisteringDavid Ahern2017-06-081-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Roopa reported attempts to delete a bond device that is referenced in a multipath route is hanging: $ ifdown bond2 # ifupdown2 command that deletes virtual devices unregister_netdevice: waiting for bond2 to become free. Usage count = 2 Steps to reproduce: echo 1 > /proc/sys/net/ipv6/conf/all/ignore_routes_with_linkdown ip link add dev bond12 type bond ip link add dev bond13 type bond ip addr add 2001:db8:2::0/64 dev bond12 ip addr add 2001:db8:3::0/64 dev bond13 ip route add 2001:db8:33::0/64 nexthop via 2001:db8:2::2 nexthop via 2001:db8:3::2 ip link del dev bond12 ip link del dev bond13 The root cause is the recent change to keep routes on a linkdown. Update the check to detect when the device is unregistering and release the route for that case. Fixes: a1a22c12060e4 ("net: ipv6: Keep nexthop of multipath route on admin down") Reported-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: Fix inconsistent teardown and release of private netdev state.David S. Miller2017-06-071-3/+4
| | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Network devices can allocate reasources and private memory using netdev_ops->ndo_init(). However, the release of these resources can occur in one of two different places. Either netdev_ops->ndo_uninit() or netdev->destructor(). The decision of which operation frees the resources depends upon whether it is necessary for all netdev refs to be released before it is safe to perform the freeing. netdev_ops->ndo_uninit() presumably can occur right after the NETDEV_UNREGISTER notifier completes and the unicast and multicast address lists are flushed. netdev->destructor(), on the other hand, does not run until the netdev references all go away. Further complicating the situation is that netdev->destructor() almost universally does also a free_netdev(). This creates a problem for the logic in register_netdevice(). Because all callers of register_netdevice() manage the freeing of the netdev, and invoke free_netdev(dev) if register_netdevice() fails. If netdev_ops->ndo_init() succeeds, but something else fails inside of register_netdevice(), it does call ndo_ops->ndo_uninit(). But it is not able to invoke netdev->destructor(). This is because netdev->destructor() will do a free_netdev() and then the caller of register_netdevice() will do the same. However, this means that the resources that would normally be released by netdev->destructor() will not be. Over the years drivers have added local hacks to deal with this, by invoking their destructor parts by hand when register_netdevice() fails. Many drivers do not try to deal with this, and instead we have leaks. Let's close this hole by formalizing the distinction between what private things need to be freed up by netdev->destructor() and whether the driver needs unregister_netdevice() to perform the free_netdev(). netdev->priv_destructor() performs all actions to free up the private resources that used to be freed by netdev->destructor(), except for free_netdev(). netdev->needs_free_netdev is a boolean that indicates whether free_netdev() should be done at the end of unregister_netdevice(). Now, register_netdevice() can sanely release all resources after ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit() and netdev->priv_destructor(). And at the end of unregister_netdevice(), we invoke netdev->priv_destructor() and optionally call free_netdev(). Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | Merge tag 'acpi-4.12-rc6' of ↵Linus Torvalds2017-06-151-0/+14
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "These revert an ACPICA commit from the 4.11 cycle that causes problems to happen on some systems and add a protection against possible kernel crashes due to table reference counter imbalance. Specifics: - Revert a 4.11 ACPICA change that made assumptions which are not satisfied on some systems and caused the enumeration of resources to fail on them (Rafael Wysocki). - Add a mechanism to prevent tables from being unmapped prematurely due to reference counter overflows (Lv Zheng)" * tag 'acpi-4.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPICA: Tables: Mechanism to handle late stage acpi_get_table() imbalance Revert "ACPICA: Disassembler: Enhance resource descriptor detection"
| * \ \ \ \ Merge branch 'acpica-fixes'Rafael J. Wysocki2017-06-151-0/+14
| |\ \ \ \ \ | | |_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | * acpica-fixes: ACPICA: Tables: Mechanism to handle late stage acpi_get_table() imbalance Revert "ACPICA: Disassembler: Enhance resource descriptor detection"
| | * | | | ACPICA: Tables: Mechanism to handle late stage acpi_get_table() imbalanceLv Zheng2017-06-121-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Considering this case: 1. A program opens a sysfs table file 65535 times, it can increase validation_count and first increment cause the table to be mapped: validation_count = 65535 2. AML execution causes "Load" to be executed on the same table, this time it cannot increase validation_count, so validation_count remains: validation_count = 65535 3. The program closes sysfs table file 65535 times, it can decrease validation_count and the last decrement cause the table to be unmapped: validation_count = 0 4. AML code still accessing the loaded table, kernel crash can be observed. To prevent that from happening, add a validation_count threashold. When it is reached, the validation_count can no longer be incremented/decremented to invalidate the table descriptor (means preventing table unmappings) Note that code added in acpi_tb_put_table() is actually a no-op but changes the warning message into a "warn once" one. Lv Zheng. Signed-off-by: Lv Zheng <lv.zheng@intel.com> [ rjw: Changelog, comments ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | | | | Merge tag 'media/v4.12-3' of ↵Linus Torvalds2017-06-152-1/+11
|\ \ \ \ \ \ | |/ / / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: - some build dependency issues at CEC core with randconfigs - fix an off by one error at vb2 - a race fix at cec core - driver fixes at tc358743, sir_ir and rainshadow-cec * tag 'media/v4.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: [media] media/cec.h: use IS_REACHABLE instead of IS_ENABLED [media] cec: race fix: don't return -ENONET in cec_receive() [media] sir_ir: infinite loop in interrupt handler [media] cec-notifier.h: handle unreachable CONFIG_CEC_CORE [media] cec: improve MEDIA_CEC_RC dependencies [media] vb2: Fix an off by one error in 'vb2_plane_vaddr' [media] rainshadow-cec: Fix missing spin_lock_init() [media] tc358743: fix register i2c_rd/wr function fix
| * | | | | [media] media/cec.h: use IS_REACHABLE instead of IS_ENABLEDHans Verkuil2017-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix messages like this: adv7842.c:(.text+0x2edadd): undefined reference to `cec_unregister_adapter' when CEC_CORE=m but the driver including media/cec.h is built-in. In that case the static inlines provided in media/cec.h should be used by that driver. Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
| * | | | | [media] cec-notifier.h: handle unreachable CONFIG_CEC_COREArnd Bergmann2017-06-061-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a link error in this specific combination of config options: CONFIG_MEDIA_CEC_SUPPORT=y CONFIG_CEC_CORE=m CONFIG_MEDIA_CEC_NOTIFIER=y CONFIG_VIDEO_STI_HDMI_CEC=m CONFIG_DRM_STI=y drivers/gpu/drm/sti/sti_hdmi.o: In function `sti_hdmi_remove': sti_hdmi.c:(.text.sti_hdmi_remove+0x10): undefined reference to `cec_notifier_set_phys_addr' sti_hdmi.c:(.text.sti_hdmi_remove+0x34): undefined reference to `cec_notifier_put' drivers/gpu/drm/sti/sti_hdmi.o: In function `sti_hdmi_connector_get_modes': sti_hdmi.c:(.text.sti_hdmi_connector_get_modes+0x4a): undefined reference to `cec_notifier_set_phys_addr_from_edid' drivers/gpu/drm/sti/sti_hdmi.o: In function `sti_hdmi_probe': sti_hdmi.c:(.text.sti_hdmi_probe+0x204): undefined reference to `cec_notifier_get' drivers/gpu/drm/sti/sti_hdmi.o: In function `sti_hdmi_connector_detect': sti_hdmi.c:(.text.sti_hdmi_connector_detect+0x36): undefined reference to `cec_notifier_set_phys_addr' drivers/gpu/drm/sti/sti_hdmi.o: In function `sti_hdmi_disable': sti_hdmi.c:(.text.sti_hdmi_disable+0xc0): undefined reference to `cec_notifier_set_phys_addr' The version below seems to work, though I don't particularly like the IS_REACHABLE() addition since that can be confusing to users. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
* | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2017-06-122-3/+2
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull key subsystem fixes from James Morris: "Here are a bunch of fixes for Linux keyrings, including: - Fix up the refcount handling now that key structs use the refcount_t type and the refcount_t ops don't allow a 0->1 transition. - Fix a potential NULL deref after error in x509_cert_parse(). - Don't put data for the crypto algorithms to use on the stack. - Fix the handling of a null payload being passed to add_key(). - Fix incorrect cleanup an uninitialised key_preparsed_payload in key_update(). - Explicit sanitisation of potentially secure data before freeing. - Fixes for the Diffie-Helman code" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits) KEYS: fix refcount_inc() on zero KEYS: Convert KEYCTL_DH_COMPUTE to use the crypto KPP API crypto : asymmetric_keys : verify_pefile:zero memory content before freeing KEYS: DH: add __user annotations to keyctl_kdf_params KEYS: DH: ensure the KDF counter is properly aligned KEYS: DH: don't feed uninitialized "otherinfo" into KDF KEYS: DH: forbid using digest_null as the KDF hash KEYS: sanitize key structs before freeing KEYS: trusted: sanitize all key material KEYS: encrypted: sanitize all key material KEYS: user_defined: sanitize key payloads KEYS: sanitize add_key() and keyctl() key payloads KEYS: fix freeing uninitialized memory in key_update() KEYS: fix dereferencing NULL payload with nonzero length KEYS: encrypted: use constant-time HMAC comparison KEYS: encrypted: fix race causing incorrect HMAC calculations KEYS: encrypted: fix buffer overread in valid_master_desc() KEYS: encrypted: avoid encrypting/decrypting stack buffers KEYS: put keyring if install_session_keyring_to_cred() fails KEYS: Delete an error message for a failed memory allocation in get_derived_key() ...
| * | | | | | KEYS: DH: add __user annotations to keyctl_kdf_paramsEric Biggers2017-06-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: James Morris <james.l.morris@oracle.com>
| * | | | | | KEYS: sanitize key structs before freeingEric Biggers2017-06-091-1/+0
| | |_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While a 'struct key' itself normally does not contain sensitive information, Documentation/security/keys.txt actually encourages this: "Having a payload is not required; and the payload can, in fact, just be a value stored in the struct key itself." In case someone has taken this advice, or will take this advice in the future, zero the key structure before freeing it. We might as well, and as a bonus this could make it a bit more difficult for an adversary to determine which keys have recently been in use. This is safe because the key_jar cache does not use a constructor. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>