linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	virtio_pci_modern: reduce number of mappings	Michael S. Tsirkin	2015-01-21	2	-8/+52
\| \| \| \| \| \| \| \| \| \| \|	We don't know the # of VQs that drivers are going to use so it's hard to predict how much memory we'll need to map. However, the relevant capability does give us an upper limit. If that's below a page, we can reduce the number of required mappings by mapping it all once ahead of the time. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	virtio_pci: macros for PCI layout offsets	Rusty Russell	2015-01-21	2	-6/+90
\| \| \| \| \| \| \|	QEMU wants it, so why not? Trust, but verify. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
*	virtio_pci: modern driver	Michael S. Tsirkin	2015-01-21	5	-10/+625
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lightly tested against qemu. One thing not implemented here is separate mappings for descriptor/avail/used rings. That's nice to have, will be done later after we have core support. This also exposes the PCI layout to userspace, and adds macros for PCI layout offsets: QEMU wants it, so why not? Trust, but verify. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
*	virtio-pci: define layout for virtio 1.0	Rusty Russell	2015-01-21	1	-0/+62
\| \| \| \| \| \| \| \| \|	Based on patches by Michael S. Tsirkin <mst@redhat.com>, but I found it hard to follow so changed to use structures which are more self-documenting. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
*	virtio_pci: move probe/remove code to common	Michael S. Tsirkin	2015-01-21	3	-72/+77
\| \| \| \| \| \| \| \|	Most of initialization is device-independent. Let's move it to common. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	virtio_pci: drop useless del_vqs call	Sasha Levin	2015-01-21	1	-1/+0
\| \| \| \| \| \| \| \| \| \|	Device VQs were getting freed twice: once in every device's removal functions, and then again in virtio_pci_legacy_remove(). The ones in devices are called first, so drop the useless second call. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	s390: add pci_iomap_range	Michael S. Tsirkin	2015-01-21	2	-7/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Virtio drivers should map the part of the range they need, not necessarily all of it. To this end, support mapping ranges within BAR on s390. Since multiple ranges can now be mapped within a BAR, we keep track of the number of mappings created, and only clear out the mapping for a BAR when this number reaches 0. Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-pci@vger.kernel.org Tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	pci: add pci_iomap_range	Michael S. Tsirkin	2015-01-21	2	-5/+40
\| \| \| \| \| \| \| \| \| \| \|	Virtio drivers should map the part of the BAR they need, not necessarily all of it. Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-pci@vger.kernel.org Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	mn10300: drop dead code	Michael S. Tsirkin	2015-01-21	1	-35/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pci-iomap.c was (apparently, mistakenly) reintroduced as part of commit 83c2dc15ce824450e7044b9f90cd529c25747ae0 MN10300: Handle cacheable PCI regions in pci_iomap() probably as side-effect of forward-porting the patch from an old kernel. It's not really needed: the generic pci_iomap does the right thing here. The new file isn't compiled so it's safe to drop. Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-pci@vger.kernel.org Cc: trivial@kernel.org Cc: David Howells <dhowells@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	virtio/balloon: verify device has config space	Michael S. Tsirkin	2015-01-21	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	Some devices might not implement config space access (e.g. remoteproc used not to - before 3.9). virtio/balloon needs config space access so make it fail gracefully if not there. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	virtio/scsi: verify device has config space	Michael S. Tsirkin	2015-01-21	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	Some devices might not implement config space access (e.g. remoteproc used not to - before 3.9). virtio/scsi needs config space access so make it fail gracefully if not there. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	virtio/net: verify device has config space	Michael S. Tsirkin	2015-01-21	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	Some devices might not implement config space access (e.g. remoteproc used not to - before 3.9). virtio/net needs config space access so make it fail gracefully if not there. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	virtio/console: verify device has config space	Michael S. Tsirkin	2015-01-21	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	Some devices might not implement config space access (e.g. remoteproc used not to - before 3.9). virtio/console needs config space access so make it fail gracefully if not there. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	virtio/blk: verify device has config space	Michael S. Tsirkin	2015-01-21	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	Some devices might not implement config space access (e.g. remoteproc used not to - before 3.9). virtio/blk needs config space access so make it fail gracefully if not there. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	virtio/9p: verify device has config space	Michael S. Tsirkin	2015-01-21	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	Some devices might not implement config space access (e.g. remoteproc used not to - before 3.9). virtio/9p needs config space access so make it fail gracefully if not there. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	virtio_pci: drop virtio_config dependency	Michael S. Tsirkin	2015-01-21	1	-1/+1
\| \| \| \| \| \| \| \|	virtio_pci does not depend on virtio_config: let's not include it, users can pull it in as necessary. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
*	Merge branch 'for-3.19-fixes' of ↵	Linus Torvalds	2015-01-20	12	-64/+118
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata fixes from Tejun Heo: - Bartlomiej will be co-maintaining PATA portion of libata. git workflow will stay the same. - sata_sil24 wasn't happy with tag ordered submission. An option to restore the old tag allocation behavior is implemented for sil24. - a very old race condition in PIO host state machine which can trigger BUG fixed. - other driver-specific changes * 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: libata: prevent HSM state change race between ISR and PIO libata: allow sata_sil24 to opt-out of tag ordered submission ata: pata_at91: depend on !ARCH_MULTIPLATFORM ahci: Remove Device ID for Intel Sunrise Point PCH ahci: Use dev_info() to inform about the lack of Device Sleep support libata: Whitelist SSDs that are known to properly return zeroes after TRIM sata_dwc_460ex: fix resource leak on error path ata: add MAINTAINERS entry for libata PATA drivers libata: clean up MAINTAINERS entries libata: export ata_get_cmd_descript() ahci_xgene: Fix the DMA state machine lockup for the ATA_CMD_PACKET PIO mode command. ahci_xgene: Fix the endianess issue in APM X-Gene SoC AHCI SATA controller driver.
\| *	libata: prevent HSM state change race between ISR and PIO	David Jeffery	2015-01-19	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is possible for ata_sff_flush_pio_task() to set ap->hsm_task_state to HSM_ST_IDLE in between the time __ata_sff_port_intr() checks for HSM_ST_IDLE and before it calls ata_sff_hsm_move() causing ata_sff_hsm_move() to BUG(). This problem is hard to reproduce making this patch hard to verify, but this fix will prevent the race. I have not been able to reproduce the problem, but here is a crash dump from a 2.6.32 kernel. On examining the ata port's state, its hsm_task_state field has a value of HSM_ST_IDLE: crash> struct ata_port.hsm_task_state ffff881c1121c000 hsm_task_state = 0 Normally, this should not be possible as ata_sff_hsm_move() was called from ata_sff_host_intr(), which checks hsm_task_state and won't call ata_sff_hsm_move() if it has a HSM_ST_IDLE value. PID: 11053 TASK: ffff8816e846cae0 CPU: 0 COMMAND: "sshd" #0 [ffff88008ba03960] machine_kexec at ffffffff81038f3b #1 [ffff88008ba039c0] crash_kexec at ffffffff810c5d92 #2 [ffff88008ba03a90] oops_end at ffffffff8152b510 #3 [ffff88008ba03ac0] die at ffffffff81010e0b #4 [ffff88008ba03af0] do_trap at ffffffff8152ad74 #5 [ffff88008ba03b50] do_invalid_op at ffffffff8100cf95 #6 [ffff88008ba03bf0] invalid_op at ffffffff8100bf9b [exception RIP: ata_sff_hsm_move+317] RIP: ffffffff813a77ad RSP: ffff88008ba03ca0 RFLAGS: 00010097 RAX: 0000000000000000 RBX: ffff881c1121dc60 RCX: 0000000000000000 RDX: ffff881c1121dd10 RSI: ffff881c1121dc60 RDI: ffff881c1121c000 RBP: ffff88008ba03d00 R8: 0000000000000000 R9: 000000000000002e R10: 000000000001003f R11: 000000000000009b R12: ffff881c1121c000 R13: 0000000000000000 R14: 0000000000000050 R15: ffff881c1121dd78 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff88008ba03d08] ata_sff_host_intr at ffffffff813a7fbd #8 [ffff88008ba03d38] ata_sff_interrupt at ffffffff813a821e #9 [ffff88008ba03d78] handle_IRQ_event at ffffffff810e6ec0 --- <IRQ stack> --- [exception RIP: pipe_poll+48] RIP: ffffffff81192780 RSP: ffff880f26d459b8 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff880f26d459c8 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff881a0539fa80 RBP: ffffffff8100bb8e R8: ffff8803b23324a0 R9: 0000000000000000 R10: ffff880f26d45dd0 R11: 0000000000000008 R12: ffffffff8109b646 R13: ffff880f26d45948 R14: 0000000000000246 R15: 0000000000000246 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 RIP: 00007f26017435c3 RSP: 00007fffe020c420 RFLAGS: 00000206 RAX: 0000000000000017 RBX: ffffffff8100b072 RCX: 00007fffe020c45c RDX: 00007f2604a3f120 RSI: 00007f2604a3f140 RDI: 000000000000000d RBP: 0000000000000000 R8: 00007fffe020e570 R9: 0101010101010101 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffe020e5f0 R13: 00007fffe020e5f4 R14: 00007f26045f373c R15: 00007fffe020e5e0 ORIG_RAX: 0000000000000017 CS: 0033 SS: 002b Somewhere between the ata_sff_hsm_move() check and the ata_sff_host_intr() check, the value changed. On examining the other cpus to see what else was running, another cpu was running the error handler routines: PID: 326 TASK: ffff881c11014aa0 CPU: 1 COMMAND: "scsi_eh_1" #0 [ffff88008ba27e90] crash_nmi_callback at ffffffff8102fee6 #1 [ffff88008ba27ea0] notifier_call_chain at ffffffff8152d515 #2 [ffff88008ba27ee0] atomic_notifier_call_chain at ffffffff8152d57a #3 [ffff88008ba27ef0] notify_die at ffffffff810a154e #4 [ffff88008ba27f20] do_nmi at ffffffff8152b1db #5 [ffff88008ba27f50] nmi at ffffffff8152aaa0 [exception RIP: _spin_lock_irqsave+47] RIP: ffffffff8152a1ff RSP: ffff881c11a73aa0 RFLAGS: 00000006 RAX: 0000000000000001 RBX: ffff881c1121deb8 RCX: 0000000000000000 RDX: 0000000000000246 RSI: 0000000000000020 RDI: ffff881c122612d8 RBP: ffff881c11a73aa0 R8: ffff881c17083800 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff881c1121c000 R13: 000000000000001f R14: ffff881c1121dd50 R15: ffff881c1121dc60 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <NMI exception stack> --- #6 [ffff881c11a73aa0] _spin_lock_irqsave at ffffffff8152a1ff #7 [ffff881c11a73aa8] ata_exec_internal_sg at ffffffff81396fb5 #8 [ffff881c11a73b58] ata_exec_internal at ffffffff81397109 #9 [ffff881c11a73bd8] atapi_eh_request_sense at ffffffff813a34eb Before it tried to acquire a spinlock, ata_exec_internal_sg() called ata_sff_flush_pio_task(). This function will set ap->hsm_task_state to HSM_ST_IDLE, and has no locking around setting this value. ata_sff_flush_pio_task() can then race with the interrupt handler and potentially set HSM_ST_IDLE at a fatal moment, which will trigger a kernel BUG. v2: Fixup comment in ata_sff_flush_pio_task() tj: Further updated comment. Use ap->lock instead of shost lock and use the [un]lock_irq variant instead of the irqsave/restore one. Signed-off-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
\| *	libata: allow sata_sil24 to opt-out of tag ordered submission	Dan Williams	2015-01-19	3	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ronny reports: https://bugzilla.kernel.org/show_bug.cgi?id=87101 "Since commit 8a4aeec8d "libata/ahci: accommodate tag ordered controllers" the access to the harddisk on the first SATA-port is failing on its first access. The access to the harddisk on the second port is working normal. When reverting the above commit, access to both harddisks is working fine again." Maintain tag ordered submission as the default, but allow sata_sil24 to continue with the old behavior. Cc: <stable@vger.kernel.org> Cc: Tejun Heo <tj@kernel.org> Reported-by: Ronny Hegewald <Ronny.Hegewald@online.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
\| *	ata: pata_at91: depend on !ARCH_MULTIPLATFORM	Alexandre Belloni	2015-01-13	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Until the driver is corrected to stop using mach/at91isam9_smc.h, it won't compile in a ARCH_MULTIPLATFORM configuration. Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Tejun Heo <tj@kernel.org>
\| *	ahci: Remove Device ID for Intel Sunrise Point PCH	James Ralston	2015-01-13	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes a duplicate AHCI-mode SATA Device ID for the Intel Sunrise Point PCH. Signed-off-by: James Ralston <james.d.ralston@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
\| *	ahci: Use dev_info() to inform about the lack of Device Sleep support	Gabriele Mazzotta	2015-01-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	According to the Serial ATA AHCI specification, Device Sleep is an optional feature and as such no errors should be printed if it's missing. Keep informing users, but use dev_info() instead of dev_err(). Signed-off-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
\| *	libata: Whitelist SSDs that are known to properly return zeroes after TRIM	Martin K. Petersen	2015-01-08	3	-8/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return Zero After Trim) flags in the ATA Command Set are unreliable in the sense that they only define what happens if the device successfully executed the DSM TRIM command. TRIM is only advisory, however, and the device is free to silently ignore all or parts of the request. In practice this renders the DRAT and RZAT flags completely useless and because the results are unpredictable we decided to disable discard in MD for 3.18 to avoid the risk of data corruption. Hardware vendors in the real world obviously need better guarantees than what the standards bodies provide. Unfortuntely those guarantees are encoded in product requirements documents rather than somewhere we can key off of them programatically. So we are compelled to disabling discard_zeroes_data for all devices unless we explicitly have data to support whitelisting them. This patch whitelists SSDs from a few of the main vendors. None of the whitelists are based on written guarantees. They are purely based on empirical evidence collected from internal and external users that have tested or qualified these drives in RAID deployments. The whitelist is only meant as a starting point and is by no means comprehensive: - All intel SSD models except for 510 - Micron M5?0/M600 - Samsung SSDs - Seagate SSDs Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tejun Heo <tj@kernel.org>
\| *	sata_dwc_460ex: fix resource leak on error path	Andy Shevchenko	2015-01-07	1	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DMA mapped IO should be unmapped on the error path in probe() and unconditionally on remove(). Fixes: 62936009f35a ([libata] Add 460EX on-chip SATA driver, sata_dwc_460ex) Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
\| *	ata: add MAINTAINERS entry for libata PATA drivers	Bartlomiej Zolnierkiewicz	2015-01-07	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add myself as the primary maintainer for libata PATA drivers. The merging process would remain unchanged with patches going through Tejun's tree. Cc: Alan Cox <alan@linux.intel.com> Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>
\| *	libata: clean up MAINTAINERS entries	Tejun Heo	2015-01-07	1	-32/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make all libata entries start with LIBATA and collect them in one place. Driver specfic ones have the second SATA or PATA prefix. Signed-off-by: Tejun Heo <tj@kernel.org>
\| *	libata: export ata_get_cmd_descript()	Andy Shevchenko	2015-01-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The driver sata_dwc_460ex is using this symbol. To build it as a module we have to have the symbol exported. This patch adds EXPORT_SYMBOL_GPL() macro for that. tj: Updated to use EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as the only known user is an in-tree driver. Suggested by Sergei. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
\| *	ahci_xgene: Fix the DMA state machine lockup for the ATA_CMD_PACKET PIO mode ↵	Suman Tripathi	2015-01-05	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	command. This patch addresses the issue with ATA_CMD_PACKET pio mode command for enumeration and device detection with ATAPI devices. The X-Gene AHCI controller has an errata in which it cannot clear the BSY bit after the PIO setup FIS. The dma state machine enters CMFatalErrorUpdate state and locks up. Signed-off-by: Suman Tripathi <stripathi@apm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
\| *	ahci_xgene: Fix the endianess issue in APM X-Gene SoC AHCI SATA controller ↵	Suman Tripathi	2015-01-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	driver. This patch fixes the big endian mode issue with function xgene_ahci_read_id. Signed-off-by: Suman Tripathi <stripathi@apm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* \|	Merge branch 'for-3.19-fixes' of ↵	Linus Torvalds	2015-01-20	1	-17/+8
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "The xfs folks have been running into weird and very rare lockups for some time now. I didn't think this could have been from workqueue side because no one else was reporting it. This time, Eric had a kdump which we looked into and it turned out this actually was a workqueue bug and the bug has been there since the beginning of concurrency managed workqueue. A worker pool ensures forward progress of the workqueues associated with it by always having at least one worker reserved from executing work items. When the pool is under contention, the idle one tries to create more workers for the pool and if that doesn't succeed quickly enough, it calls the rescuers to the pool. This logic had a subtle race condition in an early exit path. When a worker invokes this manager function, the function may return %false indicating that the caller may proceed to executing work items either because another worker is already performing the role or conditions have changed and the pool is no longer under contention. The latter part depended on the assumption that whether more workers are necessary or not remains stable while the pool is locked; however, pool->nr_running (concurrency count) may change asynchronously and it getting bumped from zero asynchronously could send off the last idle worker to execute work items. The race window is fairly narrow, and, even when it gets triggered, the pool deadlocks iff if all work items get blocked on pending work items of the pool, which is highly unlikely but can be triggered by xfs. The patch removes the race window by removing the early exit path, which doesn't server any purpose anymore anyway" * 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix subtle pool management issue which can stall whole worker_pool
\| * \|	workqueue: fix subtle pool management issue which can stall whole worker_pool	Tejun Heo	2015-01-16	1	-17/+8
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A worker_pool's forward progress is guaranteed by the fact that the last idle worker assumes the manager role to create more workers and summon the rescuers if creating workers doesn't succeed in timely manner before proceeding to execute work items. This manager role is implemented in manage_workers(), which indicates whether the worker may proceed to work item execution with its return value. This is necessary because multiple workers may contend for the manager role, and, if there already is a manager, others should proceed to work item execution. Unfortunately, the function also indicates that the worker may proceed to work item execution if need_to_create_worker() is false at the head of the function. need_to_create_worker() tests the following conditions. pending work items && !nr_running && !nr_idle The first and third conditions are protected by pool->lock and thus won't change while holding pool->lock; however, nr_running can change asynchronously as other workers block and resume and while it's likely to be zero, as someone woke this worker up in the first place, some other workers could have become runnable inbetween making it non-zero. If this happens, manage_worker() could return false even with zero nr_idle making the worker, the last idle one, proceed to execute work items. If then all workers of the pool end up blocking on a resource which can only be released by a work item which is pending on that pool, the whole pool can deadlock as there's no one to create more workers or summon the rescuers. This patch fixes the problem by removing the early exit condition from maybe_create_worker() and making manage_workers() return false iff there's already another manager, which ensures that the last worker doesn't start executing work items. We can leave the early exit condition alone and just ignore the return value but the only reason it was put there is because the manage_workers() used to perform both creations and destructions of workers and thus the function may be invoked while the pool is trying to reduce the number of workers. Now that manage_workers() is called only when more workers are needed, the only case this early exit condition is triggered is rare race conditions rendering it pointless. Tested with simulated workload and modified workqueue code which trigger the pool deadlock reliably without this patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Eric Sandeen <sandeen@sandeen.net> Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net Cc: Dave Chinner <david@fromorbit.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: stable@vger.kernel.org
* \|	Merge tag 'pinctrl-v3.19-3' of ↵	Linus Torvalds	2015-01-20	5	-31/+26
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: "Here is a (hopefully final) slew of pin control fixes for the v3.19 series. The deadlock fix is kind of serious and tagged for stable, the rest is business as usual. - Fix two deadlocks around the pin control mutexes, a long-standing issue that manifest itself in plug/unplug of pin controllers. (Tagged for stable.) - Handle an error path with zero functions in the Qualcomm pin controller. - Drop a bogus second GPIO chip added in the Lantiq driver. - Fix sudden IRQ loss on Rockchip pin controllers. - Register the GIT tree in MAINTAINERS" * tag 'pinctrl-v3.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: MAINTAINERS: add git tree reference pinctrl: qcom: Don't iterate past end of function array pinctrl: lantiq: remove bogus of_gpio_chip_add pinctrl: Fix two deadlocks pinctrl: rockchip: Avoid losing interrupts when supporting both edges
\| * \|	pinctrl: MAINTAINERS: add git tree reference	Linus Walleij	2015-01-19	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reference my pinctrl GIT tree @kernel.org Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
\| * \|	pinctrl: qcom: Don't iterate past end of function array	Stephen Boyd	2015-01-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Timur reports that this code crashes if nfunctions is 0. Fix the loop iteration to only consider valid elements of the functions array. Reported-by: Timur Tabi <timur@codeaurora.org> Cc: Pramod Gurav <pramod.gurav@smartplayin.com> Cc: Bjorn Andersson <bjorn.andersson@sonymobile.com> Cc: Ivan T. Ivanov <iivanov@mm-sol.com> Cc: Andy Gross <agross@codeaurora.org> Fixes: 327455817a92 "pinctrl: qcom: Add support for reset for apq8064" Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
\| * \|	pinctrl: lantiq: remove bogus of_gpio_chip_add	Johan Hovold	2015-01-14	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove bogus call to of_gpiochip_add (and of_gpio_chip remove in error path) which is also called when adding the gpio chip. This prevents adding the same pinctrl range twice. Fixes: 3f8c50c9b110 ("OF: pinctrl: MIPS: lantiq: implement lantiq/xway pinctrl support") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
\| * \|	pinctrl: Fix two deadlocks	Jim Lin	2015-01-14	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch is to fix two deadlock cases. Deadlock 1: CPU #1 pinctrl_register-> pinctrl_get -> create_pinctrl (Holding lock pinctrl_maps_mutex) -> get_pinctrl_dev_from_devname (Trying to acquire lock pinctrldev_list_mutex) CPU #0 pinctrl_unregister (Holding lock pinctrldev_list_mutex) -> pinctrl_put ->> pinctrl_free -> pinctrl_dt_free_maps -> pinctrl_unregister_map (Trying to acquire lock pinctrl_maps_mutex) Simply to say CPU#1 is holding lock A and trying to acquire lock B, CPU#0 is holding lock B and trying to acquire lock A. Deadlock 2: CPU #3 pinctrl_register-> pinctrl_get -> create_pinctrl (Holding lock pinctrl_maps_mutex) -> get_pinctrl_dev_from_devname (Trying to acquire lock pinctrldev_list_mutex) CPU #2 pinctrl_unregister (Holding lock pctldev->mutex) -> pinctrl_put ->> pinctrl_free -> pinctrl_dt_free_maps -> pinctrl_unregister_map (Trying to acquire lock pinctrl_maps_mutex) CPU #0 tegra_gpio_request (Holding lock pinctrldev_list_mutex) -> pinctrl_get_device_gpio_range (Trying to acquire lock pctldev->mutex) Simply to say CPU#3 is holding lock A and trying to acquire lock D, CPU#2 is holding lock B and trying to acquire lock A, CPU#0 is holding lock D and trying to acquire lock B. Cc: Stable <stable@vger.kernel.org> Signed-off-by: Jim Lin <jilin@nvidia.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
\| * \|	pinctrl: rockchip: Avoid losing interrupts when supporting both edges	Doug Anderson	2015-01-14	1	-25/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I was seeing cases where I was losing interrupts when inserting and removing SD cards. Sometimes the card would get "stuck" in the inserted state. I believe that the problem was related to the code to handle the case where we needed both rising and falling edges. This code would disable the interrupt as the polarity was switched. If an interrupt came at the wrong time it could be lost. We'll match what the gpio-dwapb.c driver does upstream and change the interrupt polarity without disabling things. Signed-off-by: Doug Anderson <dianders@chromium.org> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Tested-by: Heiko Stuebner <heiko@sntech.de> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
* \| \|	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	Linus Torvalds	2015-01-20	31	-192/+343
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull networking fixes from David Miller: 1) Socket addresses returned in the error queue need to be fully initialized before being passed on to userspace, fix from Willem de Bruijn. 2) Interrupt handling fixes to davinci_emac driver from Tony Lindgren. 3) Fix races between receive packet steering and cpu hotplug, from Eric Dumazet. 4) Allowing netlink sockets to subscribe to unknown multicast groups leads to crashes, don't allow it. From Johannes Berg. 5) One to many socket races in SCTP fixed by Daniel Borkmann. 6) Put in a guard against the mis-use of ipv6 atomic fragments, from Hagen Paul Pfeifer. 7) Fix promisc mode and ethtool crashes in sh_eth driver, from Ben Hutchings. 8) NULL deref and double kfree fix in sxgbe driver from Girish K.S and Byungho An. 9) cfg80211 deadlock fix from Arik Nemtsov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (36 commits) s2io: use snprintf() as a safety feature r8152: remove sram_read r8152: remove generic_ocp_read before writing bgmac: activate irqs only if there is nothing to poll bgmac: register napi before the device sh_eth: Fix ethtool operation crash when net device is down sh_eth: Fix promiscuous mode on chips without TSU ipv6: stop sending PTB packets for MTU < 1280 net: sctp: fix race for one-to-many sockets in sendmsg's auto associate genetlink: synchronize socket closing and family removal genetlink: disallow subscribing to unknown mcast groups genetlink: document parallel_ops net: rps: fix cpu unplug net: davinci_emac: Add support for emac on dm816x net: davinci_emac: Fix ioremap for devices with MDIO within the EMAC address space net: davinci_emac: Fix incomplete code for getting the phy from device tree net: davinci_emac: Free clock after checking the frequency net: davinci_emac: Fix runtime pm calls for davinci_emac net: davinci_emac: Fix hangs with interrupts ip: zero sockaddr returned on error queue ...
\| * \| \|	s2io: use snprintf() as a safety feature	Dan Carpenter	2015-01-20	1	-3/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"sp->desc[i]" has 25 characters. "dev->name" has 15 characters. If we used all 15 characters then the sprintf() would overflow. I changed the "sprintf(sp->name, "%s Neterion %s"" to snprintf(), as well, even though it can't overflow just to be consistent. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| * \| \|	Merge branch 'r8152'	David S. Miller	2015-01-19	1	-24/+6
\| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Hayes Wang says: ==================== r8152: couldn't read OCP_SRAM_DATA Read OCP_SRAM_DATA would read additional bytes and may let the hw abnormal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| \| * \| \|	r8152: remove sram_read	hayeswang	2015-01-19	1	-18/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Read OCP register 0xa43a~0xa43b would clear some flags which the hw would use, and it may let the device lost. However, the unit of reading is 4 bytes. That is, it would read 0xa438~0xa43b when calling sram_read() to read OCP_SRAM_DATA. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| \| * \| \|	r8152: remove generic_ocp_read before writing	hayeswang	2015-01-19	1	-6/+0
\| \|/ / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For ocp_write_word() and ocp_write_byte(), there is a generic_ocp_read() which is used to read the whole 4 byte data, keep the unchanged bytes, and modify the expected bytes. However, the "byen" could be used to determine which bytes of the 4 bytes to write, so the action could be removed. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| * \| \|	Merge branch 'bgmac'	David S. Miller	2015-01-19	1	-6/+6
\| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Hauke Mehrtens says: ==================== bgmac: some fixes to napi usage I compared the napi documentation with the bgmac driver and found some problems in that driver. These two patches should fix the problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| \| * \| \|	bgmac: activate irqs only if there is nothing to poll	Hauke Mehrtens	2015-01-19	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	IRQs should only get activated when there is nothing to poll in the queue any more and to after every poll. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
\| \| * \| \|	bgmac: register napi before the device	Hauke Mehrtens	2015-01-19	1	-3/+3
\| \|/ / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	napi should get registered before the netdev and not after. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
\| * \| \|	Merge branch 'sh_eth'	David S. Miller	2015-01-19	1	-9/+19
\| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ben Hutchings says: ==================== sh_eth fixes I'm currently looking at Ethernet support on the R-Car H2 chip, reviewing and testing the sh_eth driver. Here are fixes for two fairly obvious bugs in the driver; I will probably have some more later. These are not tested on any of the other supported chips. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| \| * \| \|	sh_eth: Fix ethtool operation crash when net device is down	Ben Hutchings	2015-01-19	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The driver connects and disconnects the PHY device whenever the net device is brought up and down. The ethtool get_settings, set_settings and nway_reset operations will dereference a null or dangling pointer if called while it is down. I think it would be preferable to keep the PHY connected, but there may be good reasons not to. As an immediate fix for this bug: - Set the phydev pointer to NULL after disconnecting the PHY - Change those three operations to return -ENODEV while the PHY is not connected Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
\| \| * \| \|	sh_eth: Fix promiscuous mode on chips without TSU	Ben Hutchings	2015-01-19	1	-9/+9
\| \|/ / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently net_device_ops::set_rx_mode is only implemented for chips with a TSU (multiple address table). However we do need to turn the PRM (promiscuous) flag on and off for other chips. - Remove the unlikely() from the TSU functions that we may safely call for chips without a TSU - Make setting of the MCT flag conditional on the tsu capability flag - Rename sh_eth_set_multicast_list() to sh_eth_set_rx_mode() and plumb it into both net_device_ops structures - Remove the previously-unreachable branch in sh_eth_rx_mode() that would otherwise reset the flags to defaults for non-TSU chips Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
\| * \| \|	ipv6: stop sending PTB packets for MTU < 1280	Hagen Paul Pfeifer	2015-01-19	1	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reduce the attack vector and stop generating IPv6 Fragment Header for paths with an MTU smaller than the minimum required IPv6 MTU size (1280 byte) - called atomic fragments. See IETF I-D "Deprecating the Generation of IPv6 Atomic Fragments" [1] for more information and how this "feature" can be misused. [1] https://tools.ietf.org/html/draft-ietf-6man-deprecate-atomfrag-generation-00 Signed-off-by: Fernando Gont <fgont@si6networks.com> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| * \| \|	net: sctp: fix race for one-to-many sockets in sendmsg's auto associate	Daniel Borkmann	2015-01-18	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I.e. one-to-many sockets in SCTP are not required to explicitly call into connect(2) or sctp_connectx(2) prior to data exchange. Instead, they can directly invoke sendmsg(2) and the SCTP stack will automatically trigger connection establishment through 4WHS via sctp_primitive_ASSOCIATE(). However, this in its current implementation is racy: INIT is being sent out immediately (as it cannot be bundled anyway) and the rest of the DATA chunks are queued up for later xmit when connection is established, meaning sendmsg(2) will return successfully. This behaviour can result in an undesired side-effect that the kernel made the application think the data has already been transmitted, although none of it has actually left the machine, worst case even after close(2)'ing the socket. Instead, when the association from client side has been shut down e.g. first gracefully through SCTP_EOF and then close(2), the client could afterwards still receive the server's INIT_ACK due to a connection with higher latency. This INIT_ACK is then considered out of the blue and hence responded with ABORT as there was no alive assoc found anymore. This can be easily reproduced f.e. with sctp_test application from lksctp. One way to fix this race is to wait for the handshake to actually complete. The fix defers waiting after sctp_primitive_ASSOCIATE() and sctp_primitive_SEND() succeeded, so that DATA chunks cooked up from sctp_sendmsg() have already been placed into the output queue through the side-effect interpreter, and therefore can then be bundeled together with COOKIE_ECHO control chunks. strace from example application (shortened): socket(PF_INET, SOCK_SEQPACKET, IPPROTO_SCTP) = 3 sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")}, msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5 sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")}, msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5 sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")}, msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5 sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")}, msg_iov(1)=[{"hello", 5}], msg_controllen=0, msg_flags=0}, 0) = 5 sendmsg(3, {msg_name(28)={sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.1.115")}, msg_iov(0)=[], msg_controllen=48, {cmsg_len=48, cmsg_level=0x84 /* SOL_??? */, cmsg_type=, ...}, msg_flags=0}, 0) = 0 // graceful shutdown for SOCK_SEQPACKET via SCTP_EOF close(3) = 0 tcpdump before patch (fooling the application): 22:33:36.306142 IP 192.168.1.114.41462 > 192.168.1.115.8888: sctp (1) [INIT] [init tag: 3879023686] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 3139201684] 22:33:36.316619 IP 192.168.1.115.8888 > 192.168.1.114.41462: sctp (1) [INIT ACK] [init tag: 3345394793] [rwnd: 106496] [OS: 10] [MIS: 10] [init TSN: 3380109591] 22:33:36.317600 IP 192.168.1.114.41462 > 192.168.1.115.8888: sctp (1) [ABORT] tcpdump after patch: 14:28:58.884116 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [INIT] [init tag: 438593213] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 3092969729] 14:28:58.888414 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [INIT ACK] [init tag: 381429855] [rwnd: 106496] [OS: 10] [MIS: 10] [init TSN: 2141904492] 14:28:58.888638 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [COOKIE ECHO] , (2) [DATA] (B)(E) [TSN: 3092969729] [...] 14:28:58.893278 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [COOKIE ACK] , (2) [SACK] [cum ack 3092969729] [a_rwnd 106491] [#gap acks 0] [#dup tsns 0] 14:28:58.893591 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [DATA] (B)(E) [TSN: 3092969730] [...] 14:28:59.096963 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [SACK] [cum ack 3092969730] [a_rwnd 106496] [#gap acks 0] [#dup tsns 0] 14:28:59.097086 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [DATA] (B)(E) [TSN: 3092969731] [...] , (2) [DATA] (B)(E) [TSN: 3092969732] [...] 14:28:59.103218 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [SACK] [cum ack 3092969732] [a_rwnd 106486] [#gap acks 0] [#dup tsns 0] 14:28:59.103330 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [SHUTDOWN] 14:28:59.107793 IP 192.168.1.115.8888 > 192.168.1.114.35846: sctp (1) [SHUTDOWN ACK] 14:28:59.107890 IP 192.168.1.114.35846 > 192.168.1.115.8888: sctp (1) [SHUTDOWN COMPLETE] Looks like this bug is from the pre-git history museum. ;) Fixes: 08707d5482df ("lksctp-2_5_31-0_5_1.patch") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>