linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	vfio: Fix bug in vfio_device_get_from_name()	Joerg Roedel	2015-11-04	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The vfio_device_get_from_name() function might return a non-NULL pointer, when called with a device name that is not found in the list. This causes undefined behavior, in my case calling an invalid function pointer later on: kernel tried to execute NX-protected page - exploit attempt? (uid: 0) BUG: unable to handle kernel paging request at ffff8800cb3ddc08 [...] Call Trace: [<ffffffffa03bd733>] ? vfio_group_fops_unl_ioctl+0x253/0x410 [vfio] [<ffffffff811efc4d>] do_vfs_ioctl+0x2cd/0x4c0 [<ffffffff811f9657>] ? __fget+0x77/0xb0 [<ffffffff811efeb9>] SyS_ioctl+0x79/0x90 [<ffffffff81001bb0>] ? syscall_return_slowpath+0x50/0x130 [<ffffffff8167f776>] entry_SYSCALL_64_fastpath+0x16/0x75 Fix the issue by returning NULL when there is no device with the requested name in the list. Cc: stable@vger.kernel.org # v4.2+ Fixes: 4bc94d5dc95d ("vfio: Fix lockdep issue") Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	VFIO: platform: reset: AMD xgbe reset module	Eric Auger	2015-11-03	3	-0/+137
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch introduces a module that registers and implements a low-level reset function for the AMD XGBE device. it performs the following actions: - reset the PHY - disable auto-negotiation - disable & clear auto-negotiation IRQ - soft-reset the MAC Those tiny pieces of code are inherited from the native xgbe driver. Signed-off-by: Eric Auger <eric.auger@linaro.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio: platform: reset: calxedaxgmac: fix ioaddr leak	Eric Auger	2015-11-03	1	-7/+7
\| \| \| \| \| \| \| \| \| \|	In the current code the vfio_platform_region is copied on the stack. As a consequence the ioaddr address is not iounmapped in the vfio platform driver (vfio_platform_regions_cleanup). The patch uses the pointer to the region instead. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio: platform: add dev_info on device reset	Eric Auger	2015-11-03	2	-2/+13
\| \| \| \| \| \| \| \| \| \| \|	It might be helpful for the end-user to check the device reset function was found by the vfio platform reset framework. Lets store a pointer to the struct device in vfio_platform_device and trace when the reset function is called or not found. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio: platform: use list of registered reset function	Eric Auger	2015-11-03	3	-30/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove the static lookup table and use the dynamic list of registered reset functions instead. Also load the reset module through its alias. The reset struct module pointer is stored in vfio_platform_device. We also remove the useless struct device pointer parameter in vfio_platform_get_reset. This patch fixes the issue related to the usage of __symbol_get, which besides from being moot, prevented compilation with CONFIG_MODULES disabled. Also usage of MODULE_ALIAS makes possible to add a new reset module without needing to update the framework. This was suggested by Arnd. Signed-off-by: Eric Auger <eric.auger@linaro.org> Reported-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio: platform: add compat in vfio_platform_device	Eric Auger	2015-11-03	2	-7/+9
\| \| \| \| \| \| \| \|	Let's retrieve the compatibility string on probe and store it in the vfio_platform_device struct Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio: platform: reset: calxedaxgmac: add reset function registration	Eric Auger	2015-11-03	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	This patch adds the reset function registration/unregistration. This is handled through the module_vfio_reset_handler macro. This latter also defines a MODULE_ALIAS which simplifies the load from vfio-platform. Signed-off-by: Eric Auger <eric.auger@linaro.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio: platform: introduce module_vfio_reset_handler macro	Eric Auger	2015-11-03	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \|	The module_vfio_reset_handler macro - define a module alias - implement module init/exit function which respectively registers and unregisters the reset function. Signed-off-by: Eric Auger <eric.auger@linaro.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio: platform: add capability to register a reset function	Eric Auger	2015-11-03	2	-0/+47
\| \| \| \| \| \| \| \| \| \| \| \|	In preparation for subsequent changes in reset function lookup, lets introduce a dynamic list of reset combos (compat string, reset module, reset function). The list can be populated/voided with vfio_platform_register/unregister_reset. Those are not yet used in this patch. Signed-off-by: Eric Auger <eric.auger@linaro.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio: platform: introduce vfio-platform-base module	Eric Auger	2015-11-03	5	-4/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To prepare for vfio platform reset rework let's build vfio_platform_common.c and vfio_platform_irq.c in a separate module from vfio-platform and vfio-amba. This makes possible to have separate module inits and works around a race between platform driver init and vfio reset module init: that way we make sure symbols exported by base are available when vfio-platform driver gets probed. The open/release being implemented in the base module, the ref count is applied to the parent module instead. Signed-off-by: Eric Auger <eric.auger@linaro.org> Suggested-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio/platform: store mapped memory in region, instead of an on-stack copy	James Morse	2015-11-03	1	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	vfio_platform_{read,write}_mmio() call ioremap_nocache() to map a region of io memory, which they store in struct vfio_platform_region to be eventually re-used, or unmapped by vfio_platform_regions_cleanup(). These functions receive a copy of their struct vfio_platform_region argument on the stack - so these mapped areas are always allocated, and always leaked. Pass this argument as a pointer instead. Fixes: 6e3f26456009 "vfio/platform: read and write support for the device fd" Signed-off-by: James Morse <james.morse@arm.com> Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com> Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio/type1: handle case where IOMMU does not support PAGE_SIZE size	Eric Auger	2015-11-03	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current vfio_pgsize_bitmap code hides the supported IOMMU page sizes smaller than PAGE_SIZE. As a result, in case the IOMMU does not support PAGE_SIZE page, the alignment check on map/unmap is done with larger page sizes, if any. This can fail although mapping could be done with pages smaller than PAGE_SIZE. This patch modifies vfio_pgsize_bitmap implementation so that, in case the IOMMU supports page sizes smaller than PAGE_SIZE we pretend PAGE_SIZE is supported and hide sub-PAGE_SIZE sizes. That way the user will be able to map/unmap buffers whose size/ start address is aligned with PAGE_SIZE. Pinning code uses that granularity while iommu driver can use the sub-PAGE_SIZE size to map the buffer. Signed-off-by: Eric Auger <eric.auger@linaro.org> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ	Eric Auger	2015-10-27	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	The vfio platform driver currently sets the IRQ_NOAUTOEN before doing the request_irq to properly handle the user masking. However it does not clear it when de-assigning the IRQ. This brings issues when loading the native driver again which may not explicitly enable the IRQ. This problem was observed with xgbe driver. Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	vfio/pci: Use kernel VPD access functions	Alex Williamson	2015-10-27	1	-1/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The PCI VPD capability operates on a set of window registers in PCI config space. Writing to the address register triggers either a read or write, depending on the setting of the PCI_VPD_ADDR_F bit within the address register. The data register provides either the source for writes or the target for reads. This model is susceptible to being broken by concurrent access, for which the kernel has adopted a set of access functions to serialize these registers. Additionally, commits like 932c435caba8 ("PCI: Add dev_flags bit to access VPD through function 0") and 7aa6ca4d39ed ("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate that VPD registers can be shared between functions on multifunction devices creating dependencies between otherwise independent devices. Fortunately it's quite easy to emulate the VPD registers, simply storing copies of the address and data registers in memory and triggering a VPD read or write on writes to the address register. This allows vfio users to avoid seeing spurious register changes from accesses on other devices and enables the use of shared quirks in the host kernel. We can theoretically still race with access through sysfs, but the window of opportunity is much smaller. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Mark Rustad <mark.d.rustad@intel.com>
*	vfio: Whitelist PCI bridges	Alex Williamson	2015-10-27	1	-6/+25
\| \| \| \| \| \| \|	When determining whether a group is viable, we already allow devices bound to pcieport. Generalize this to include any PCI bridge device. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
*	Linux 4.3-rc7v4.3-rc7	Linus Torvalds	2015-10-25	1	-1/+1
\|
*	Merge tag 'usb-4.3-rc7' of ↵	Linus Torvalds	2015-10-24	2	-5/+26
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are three xhci driver fixes for reported issues for 4.3-rc7 All have been in linux-next for a while with no problems" * tag 'usb-4.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: xhci: Add spurious wakeup quirk for LynxPoint-LP controllers xhci: handle no ping response error properly xhci: don't finish a TD if we get a short transfer event mid TD
\| *	xhci: Add spurious wakeup quirk for LynxPoint-LP controllers	Laura Abbott	2015-10-17	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We received several reports of systems rebooting and powering on after an attempted shutdown. Testing showed that setting XHCI_SPURIOUS_WAKEUP quirk in addition to the XHCI_SPURIOUS_REBOOT quirk allowed the system to shutdown as expected for LynxPoint-LP xHCI controllers. Set the quirk back. Note that the quirk was originally introduced for LynxPoint and LynxPoint-LP just for this same reason. See: commit 638298dc66ea ("xhci: Fix spurious wakeups after S5 on Haswell") It was later limited to only concern HP machines as it caused regression on some machines, see both bug and commit: Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=66171 commit 6962d914f317 ("xhci: Limit the spurious wakeup fix only to HP machines") Later it was discovered that the powering on after shutdown was limited to LynxPoint-LP (Haswell-ULT) and that some non-LP HP machine suffered from spontaneous resume from S3 (which should not be related to the SPURIOUS_WAKEUP quirk at all). An attempt to fix this then removed the SPURIOUS_WAKEUP flag usage completely. commit b45abacde3d5 ("xhci: no switching back on non-ULT Haswell") Current understanding is that LynxPoint-LP (Haswell ULT) machines need the SPURIOUS_WAKEUP quirk, otherwise they will restart, and plain Lynxpoint (Haswell) machines may _not_ have the quirk set otherwise they again will restart. Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Cc: Takashi Iwai <tiwai@suse.de> Cc: Oliver Neukum <oneukum@suse.com> [Added more history to commit message -Mathias] Cc: stable <stable@vger.kernel.org> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
\| *	xhci: handle no ping response error properly	Mathias Nyman	2015-10-17	1	-5/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a host fails to wake up a isochronous SuperSpeed device from U1/U2 in time for a isoch transfer it will generate a "No ping response error" Host will then move to the next transfer descriptor. Handle this case in the same way as missed service errors, tag the current TD as skipped and handle it on the next transfer event. Cc: stable <stable@vger.kernel.org> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
\| *	xhci: don't finish a TD if we get a short transfer event mid TD	Mathias Nyman	2015-10-17	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the difference is big enough between the bytes asked and received in a bulk transfer we can get a short transfer event pointing to a TRB in the middle of the TD. We don't want to handle the TD yet as we will anyway receive a new event for the last TRB in the TD. Hold off from finishing the TD and removing it from the list until we receive an event for the last TRB in the TD Cc: stable <stable@vger.kernel.org> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* \|	Merge tag 'tty-4.3-rc7' of ↵	Linus Torvalds	2015-10-24	2	-4/+1
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial fixes from Greg KH: "Here are two fixes that resolve reported issues, one with the 8250 driver, and the other with the generic fbcon driver. Both have been in linux-next for a while" * tag 'tty-4.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: fbcon: initialize blink interval before calling fb_set_par Revert "serial: 8250_dma: don't bother DMA with small transfers"
\| * \|	fbcon: initialize blink interval before calling fb_set_par	Scot Doyle	2015-10-18	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit 27a4c827c34ac4256a190cc9d24607f953c1c459 fbcon: use the cursor blink interval provided by vt a PPC64LE kernel fails to boot when fbcon_add_cursor_timer uses an uninitialized ops->cur_blink_jiffies. Prevent by initializing in fbcon_init before the call to info->fbops->fb_set_par. Reported-and-tested-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Scot Doyle <lkml14@scotdoyle.com> Cc: <stable@vger.kernel.org> [v4.2] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
\| * \|	Revert "serial: 8250_dma: don't bother DMA with small transfers"	Frederic Danis	2015-10-18	1	-4/+0
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 9119fba0cfeda6d415c9f068df66838a104b87cb. This commit prevents from sending "big" file using Bluetooth. When sending a lot of data quickly through the Bluetooth interface, and after a variable amount of data sent, transfer fails with error: kernel: [ 415.247453] Bluetooth: hci0 hardware error 0x00 Found on T100TA. After reverting this commit, send works fine for any file size. Signed-off-by: Frederic Danis <frederic.danis@linux.intel.com> Fixes: 9119fba0cfed (serial: 8250_dma: don't bother DMA with small transfers) Cc: stable@vger.kernel.org Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* \|	Merge tag 'staging-4.3-rc7' of ↵	Linus Torvalds	2015-10-24	4	-11/+40
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging driver fixes from Greg KH: "Here are four iio driver fixes for 4.3-rc7, fixing some reported issues. All of these have been in linux-next for a while" * tag 'staging-4.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: iio: mxs-lradc: Fix temperature offset iio: accel: sca3000: memory corruption in sca3000_read_first_n_hw_rb() iio: st_accel: fix interrupt handling on LIS3LV02 iio: adc: twl4030: Fix ADC[3:6] readings
\| * \	Merge tag 'iio-fixes-for-4.3a' of ↵	Greg Kroah-Hartman	2015-10-17	4	-11/+40
\| \|\ \ \| \| \|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-linus Jonathan writes: First set of IIO fixes for the 4.3 cycle. * twl4030 - incorrect readings for some channels due to a failure to initialize a bias regulator or configure the lines for input rather than USB use. * lis3lv02 - a missunderstanding of the way the interrupts worked on this chip lead to activation of the wrong interrupt. * sca3000 - an old bug meant that memory corruption could occur in the hardware ring buffer readout function. * mxs-lradc - wrong temp offset.
\| \| *	iio: mxs-lradc: Fix temperature offset	Alexandre Belloni	2015-10-11	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	0° Kelvin is actually −273.15°C, not -272.15°C. Fix the temperature offset. Also improve the comment explaining the calculation. Reported-by: Janusz Użycki <j.uzycki@elpromaelectronics.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Acked-by: Stefan Wahren <stefan.wahren@i2se.com> Acked-by: Marek Vasut <marex@denx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
\| \| *	iio: accel: sca3000: memory corruption in sca3000_read_first_n_hw_rb()	Dan Carpenter	2015-10-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"num_read" is in byte units but we are write u16s so we end up write twice as much as intended. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
\| \| *	iio: st_accel: fix interrupt handling on LIS3LV02	Linus Walleij	2015-10-03	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This accelerometer accidentally either emits a DRDY signal or an IRQ signal. Accidentally I activated the IRQ signal as I thought it was analogous to the interrupt generator on other ST accelerometers. This was wrong. After this patch generic_buffer gives a nice stream of accelerometer readings. Fixes: 3acddf74f807778f "iio: st-sensors: add support for lis3lv02d accelerometer" Cc: Denis CIOCCA <denis.ciocca@st.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
\| \| *	iio: adc: twl4030: Fix ADC[3:6] readings	Adam YH Lee	2015-10-03	1	-0/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MADC[3:6] reads incorrect values without these two following changes: - enable the 3v1 bias regulator for ADC[3:6] - configure ADC[3:6] lines as input, not as USB Signed-off-by: Adam YH Lee <adam.yh.lee@gmail.com> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
* \| \|	Merge tag 'for-linus' of ↵	Linus Torvalds	2015-10-24	5	-14/+46
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull infiniband fixes from Doug Ledford: "It's late in the game, I know, but these fixes seemed important enough to warrant a late pull request. They all involve oopses or use after frees or corruptions. Six serious fixes: - Hold the mutex around the find and corresponding update of our gid - The ifa list is rcu protected, copy its contents under rcu to avoid using a freed structure - On error, netdev might be null, so check it before trying to release it - On init, if workqueue alloc fails, fail init - The new demux patches exposed a bug in mlx5 and ipath drivers, we need to use the payload P_Key to determine the P_Key the packet arrived on because the hardware doesn't tell us the truth - Due to a couple convoluted error flows, it is possible for the CM to trigger a use_after_free and a double_free of rb nodes. Add two checks to prevent that. This code has worked for 10+ years. It is likely that some of the recent changes have caused this issue to surface. The current patch will protect us from nasty events for now while we track down why this is just now showing up" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: IB/cm: Fix rb-tree duplicate free and use-after-free IB/cma: Use inner P_Key to determine netdev IB/ucma: check workqueue allocation before usage IB/cma: Potential NULL dereference in cma_id_from_event IB/core: Fix use after free of ifa IB/core: Fix memory corruption in ib_cache_gid_set_default_gid
\| * \| \|	IB/cm: Fix rb-tree duplicate free and use-after-free	Doron Tsur	2015-10-21	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ib_send_cm_sidr_rep could sometimes erase the node from the sidr (depending on errors in the process). Since ib_send_cm_sidr_rep is called both from cm_sidr_req_handler and cm_destroy_id, cm_id_priv could be either erased from the rb_tree twice or not erased at all. Fixing that by making sure it's erased only once before freeing cm_id_priv. Fixes: a977049dacde ('[PATCH] IB: Add the kernel CM implementation') Signed-off-by: Doron Tsur <doront@mellanox.com> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
\| * \| \|	IB/cma: Use inner P_Key to determine netdev	Haggai Eran	2015-10-20	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When discussing the patches to demux ids in rdma_cm instead of ib_cm, it was decided that it is best to use the P_Key value in the packet headers. However, the mlx5 and ipath drivers are currently unable to send correct P_Key values in GMP headers. They always send using a single P_Key that is set during the GSI QP initialization. Change the rdma_cm code to look at the P_Key value that is part of the packet payload as a workaround. Once the drivers are fixed this patch can be reverted. Fixes: 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to RDMA CM") Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
\| * \| \|	IB/ucma: check workqueue allocation before usage	Sasha Levin	2015-10-20	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allocating a workqueue might fail, which wasn't checked so far and would lead to NULL ptr derefs when an attempt to use it was made. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
\| * \| \|	IB/cma: Potential NULL dereference in cma_id_from_event	Haggai Eran	2015-10-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the lookup of a listening ID failed for an AF_IB request, the code would try to call dev_put() on a NULL net_dev. Fixes: be688195bd08 ("IB/cma: Fix net_dev reference leak with failed requests") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
\| * \| \|	IB/core: Fix use after free of ifa	Matan Barak	2015-10-20	1	-8/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When using ifup/ifdown while executing enum_netdev_ipv4_ips, ifa could become invalid and cause use after free error. Fixing it by protecting with RCU lock. Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management') Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
\| * \| \|	IB/core: Fix memory corruption in ib_cache_gid_set_default_gid	Doron Tsur	2015-10-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When ib_cache_gid_set_default_gid is called from several threads, updating the table could make find_gid fail, therefore a negative index will be retruned and an invalid table entry will be used. Locking find_gid as well fixes this problem. Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management') Signed-off-by: Doron Tsur <doront@mellanox.com> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* \| \| \|	Merge tag 'dm-4.3-fixes-4' of ↵	Linus Torvalds	2015-10-24	3	-8/+13
\|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "Three stable fixes (two in btree code used by DM thinp and one to properly store flags in DM cache metadata's superblock)" * tag 'dm-4.3-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: the CLEAN_SHUTDOWN flag was not being set dm btree: fix leak of bufio-backed block in btree_split_beneath error path dm btree remove: fix a bug when rebalancing nodes after removal
\| * \| \| \|	dm cache: the CLEAN_SHUTDOWN flag was not being set	Joe Thornber	2015-10-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the CLEAN_SHUTDOWN flag is not set when a cache is loaded then all cache blocks are marked as dirty and a full writeback occurs. __commit_transaction() is responsible for setting/clearing CLEAN_SHUTDOWN (based the flags_mutator that is passed in). Fix this issue, of the cache's on-disk flags being wrong, by making sure __commit_transaction() does not reset the flags after the mutator has altered the flags in preparation for them being serialized to disk. before: sb_flags = mutator(le32_to_cpu(disk_super->flags)); disk_super->flags = cpu_to_le32(sb_flags); disk_super->flags = cpu_to_le32(cmd->flags); after: disk_super->flags = cpu_to_le32(cmd->flags); sb_flags = mutator(le32_to_cpu(disk_super->flags)); disk_super->flags = cpu_to_le32(sb_flags); Reported-by: Bogdan Vasiliev <bogdan.vasiliev@gmail.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
\| * \| \| \|	dm btree: fix leak of bufio-backed block in btree_split_beneath error path	Mike Snitzer	2015-10-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	btree_split_beneath()'s error path had an outstanding FIXME that speaks directly to the potential for _not_ cleaning up a previously allocated bufio-backed block. Fix this by releasing the previously allocated bufio block using unlock_block(). Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com> Cc: stable@vger.kernel.org
\| * \| \| \|	dm btree remove: fix a bug when rebalancing nodes after removal	Joe Thornber	2015-10-23	1	-6/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 4c7e309340ff ("dm btree remove: fix bug in redistribute3") wasn't a complete fix for redistribute3(). The redistribute3 function takes 3 btree nodes and shares out the entries evenly between them. If the three nodes in total contained (MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in the center. Fix this issue by being more careful about calculating the target number of entries for the left and right nodes. Unit tested in userspace using this program: https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.c Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
* \| \| \| \|	Merge branch 'for-linus' of git://git.kernel.dk/linux-block	Linus Torvalds	2015-10-24	14	-144/+167
\|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull block layer fixes from Jens Axboe: "A final set of fixes for 4.3. It is (again) bigger than I would have liked, but it's all been through the testing mill and has been carefully reviewed by multiple parties. Each fix is either a regression fix for this cycle, or is marked stable. You can scold me at KS. The pull request contains: - Three simple fixes for NVMe, fixing regressions since 4.3. From Arnd, Christoph, and Keith. - A single xen-blkfront fix from Cathy, fixing a NULL dereference if an error is returned through the staste change callback. - Fixup for some bad/sloppy code in nbd that got introduced earlier in this cycle. From Markus Pargmann. - A blk-mq tagset use-after-free fix from Junichi. - A backing device lifetime fix from Tejun, fixing a crash. - And finally, a set of regression/stable fixes for cgroup writeback from Tejun" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() NVMe: Fix memory leak on retried commands block: don't release bdi while request_queue has live references nvme: use an integer value to Linux errno values blk-mq: fix use-after-free in blk_mq_free_tag_set() nvme: fix 32-bit build warning writeback: fix incorrect calculation of available memory for memcg domains writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions writeback: bdi_writeback iteration must not skip dying ones writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration nbd: Add locking for tasks xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)
\| * \| \| \| \|	writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in ↵	Tejun Heo	2015-10-21	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cgwb_bdi_destroy() a20135ffbc44 ("writeback: don't drain bdi_writeback_congested on bdi destruction") added rbtree_postorder_for_each_entry_safe() which is used to remove all entries; however, according to Cody, the iterator isn't safe against operations which may rebalance the tree. Fix it by switching to repeatedly removing rb_first() until empty. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Cody P Schafer <dev@codyps.com> Fixes: a20135ffbc44 ("writeback: don't drain bdi_writeback_congested on bdi destruction") Link: http://lkml.kernel.org/g/1443997973-1700-1-git-send-email-dev@codyps.com Signed-off-by: Jens Axboe <axboe@fb.com>
\| * \| \| \| \|	NVMe: Fix memory leak on retried commands	Keith Busch	2015-10-15	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Resources are reallocated for requeued commands, so unmap and release the iod for the failed command. It's a pretty bad memory leak and causes a kernel hang if you remove a drive because of a busy dma pool. You'll get messages spewing like this: nvme 0000:xx:xx.x: dma_pool_destroy prp list 256, ffff880420dec000 busy and lock up pci and the driver since removal never completes while holding a lock. Cc: stable@vger.kernel.org Cc: <stable@vger.kernel.org> # 4.0.x- Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
\| * \| \| \| \|	block: don't release bdi while request_queue has live references	Tejun Heo	2015-10-15	4	-3/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bdi's are initialized in two steps, bdi_init() and bdi_register(), but destroyed in a single step by bdi_destroy() which, for a bdi embedded in a request_queue, is called during blk_cleanup_queue() which makes the queue invisible and starts the draining of remaining usages. A request_queue's user can access the congestion state of the embedded bdi as long as it holds a reference to the queue. As such, it may access the congested state of a queue which finished blk_cleanup_queue() but hasn't reached blk_release_queue() yet. Because the congested state was embedded in backing_dev_info which in turn is embedded in request_queue, accessing the congested state after bdi_destroy() was called was fine. The bdi was destroyed but the memory region for the congested state remained accessible till the queue got released. a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") changed the situation. Now, the root congested state which is expected to be pinned while request_queue remains accessible is separately reference counted and the base ref is put during bdi_destroy(). This means that the root congested state may go away prematurely while the queue is between bdi_dstroy() and blk_cleanup_queue(), which was detected by Andrey's KASAN tests. The root cause of this problem is that bdi doesn't distinguish the two steps of destruction, unregistration and release, and now the root congested state actually requires a separate release step. To fix the issue, this patch separates out bdi_unregister() and bdi_exit() from bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue() and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a simple wrapper calling the two steps back-to-back. While at it, the prototype of bdi_destroy() is moved right below bdi_setup_and_register() so that the counterpart operations are located together. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Cc: stable@vger.kernel.org # v4.2+ Reported-and-tested-by: Andrey Konovalov <andreyknvl@google.com> Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com Reviewed-by: Jan Kara <jack@suse.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
\| * \| \| \| \|	nvme: use an integer value to Linux errno values	Christoph Hellwig	2015-10-15	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use a separate integer variable to hold the signed Linux errno values we pass back to the block layer. Note that for pass through commands those might still be NVMe values, but those fit into the int as well. Fixes: f4829a9b7a61: ("blk-mq: fix racy updates of rq->errors") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
\| * \| \| \| \|	blk-mq: fix use-after-free in blk_mq_free_tag_set()	Junichi Nomura	2015-10-15	2	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tags is freed in blk_mq_free_rq_map() and should not be used after that. The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because free_cpumask_var() is nop. tags->cpumask is allocated in blk_mq_init_tags() so it's natural to free cpumask in its counter part, blk_mq_free_tags(). Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Keith Busch <keith.busch@intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
\| * \| \| \| \|	nvme: fix 32-bit build warning	Arnd Bergmann	2015-10-12	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Compiling the nvme driver on 32-bit warns about a cast from a __u64 variable to a pointer: drivers/block/nvme-core.c: In function 'nvme_submit_io': drivers/block/nvme-core.c:1847:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] (void __user *)io.addr, length, NULL, 0); The cast here is intentional and safe, so we can shut up the gcc warning by adding an intermediate cast to 'uintptr_t'. I had previously submitted a patch to fix this problem in the nvme driver, but it was accepted on the same day that two new warnings got added. For clarification, I also change the third instance of this cast to use uintptr_t instead of unsigned long now. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: d29ec8241c10e ("nvme: submit internal commands through the block layer") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
\| * \| \| \| \|	writeback: fix incorrect calculation of available memory for memcg domains	Tejun Heo	2015-10-12	3	-32/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For memcg domains, the amount of available memory was calculated as min(the amount currently in use + headroom according to memcg, total clean memory) This isn't quite correct as what should be capped by the amount of clean memory is the headroom, not the sum of memory in use and headroom. For example, if a memcg domain has a significant amount of dirty memory, the above can lead to a value which is lower than the current amount in use which doesn't make much sense. In most circumstances, the above leads to a number which is somewhat but not drastically lower. As the amount of memory which can be readily allocated to the memcg domain is capped by the amount of system-wide clean memory which is not already assigned to the memcg itself, the number we want is the amount currently in use + min(headroom according to memcg, clean memory elsewhere in the system) This patch updates mem_cgroup_wb_stats() to return the number of filepages and headroom instead of the calculated available pages. mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the above calculation from file, headroom, dirty and globally clean pages. v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading to build failure when !CGROUP_WRITEBACK. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling") Signed-off-by: Jens Axboe <axboe@fb.com>
\| * \| \| \| \|	writeback: memcg dirty_throttle_control should be initialized with ↵	Tejun Heo	2015-10-12	1	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	wb->memcg_completions MDTC_INIT() is used to initialize dirty_throttle_control for memcg domains. It used DTC_INIT_COMMON() to initialized mdtc->wb and ->wb_completions which is incorrect as DTC_INIT_COMMON() sets the latter to wb->completions instead of wb->memcg_completions. This can lead to wildly incorrect results when calculating the proportion of dirty memory the memcg domain should get. Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize mdtc->wb_completions to wb->memcg_completions. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: c2aa723a6093 ("writeback: implement memcg writeback domain based throttling") Signed-off-by: Jens Axboe <axboe@fb.com>
\| * \| \| \| \|	writeback: bdi_writeback iteration must not skip dying ones	Tejun Heo	2015-10-12	5	-75/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bdi_for_each_wb() is used in several places to wake up or issue writeback work items to all wb's (bdi_writeback's) on a given bdi. The iteration is performed by walking bdi->cgwb_tree; however, the tree only indexes wb's which are currently active. For example, when a memcg gets associated with a different blkcg, the old wb is removed from the tree so that the new one can be indexed. The old wb starts dying from then on but will linger till all its inodes are drained. As these dying wb's may still host dirty inodes, writeback operations which affect all wb's must include them. bdi_for_each_wb() skipping dying wb's led to sync(2) missing and failing to sync the inodes belonging to those wb's. This patch adds a RCU protected @bdi->wb_list which lists all wb's beloinging to that bdi. wb's are added on creation and removed on release rather than on the start of destruction. bdi_for_each_wb() usages are replaced with list_for_each[_continue]_rcu() iterations over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed. v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs() fixed and unnecessary list head severing in cgwb_bdi_destroy() removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com> Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()") Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>