summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* md-cluster: check the return value of process_recvd_msgGuoqing Jiang2016-05-091-3/+10
| | | | | | | | | We don't need to run the full path of recv_daemon if process_recvd_msg doesn't return 0. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: gather resync infos and enable recv_thread after bitmap is readyGuoqing Jiang2016-05-093-6/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The in-memory bitmap is not ready when node joins cluster, so it doesn't make sense to make gather_all_resync_info() called so earlier, we need to call it after the node's bitmap is setup. Also, recv_thread could be wake up after node joins cluster, but it could cause problem if node receives RESYNCING message without persionality since mddev->pers->quiesce is called in process_suspend_info. This commit introduces a new cluster interface load_bitmaps to fix above problems, load_bitmaps is called in bitmap_load where bitmap and persionality are ready, and load_bitmaps does the following tasks: 1. call gather_all_resync_info to load all the node's bitmap info. 2. set MD_CLUSTER_ALREADY_IN_CLUSTER bit to recv_thread could be wake up, and wake up recv_thread if there is pending recv event. Then ack_bast only wakes up recv_thread after IN_CLUSTER bit is ready otherwise MD_CLUSTER_PENDING_RESYNC_EVENT is set. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: set MD_CHANGE_PENDING in a atomic regionGuoqing Jiang2016-05-096-23/+40
| | | | | | | | | | | | | | | | | | | | | | | | | Some code waits for a metadata update by: 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN) 2. setting MD_CHANGE_PENDING and waking the management thread 3. waiting for MD_CHANGE_PENDING to be cleared If the first two are done without locking, the code in md_update_sb() which checks if it needs to repeat might test if an update is needed before step 1, then clear MD_CHANGE_PENDING after step 2, resulting in the wait returning early. So make sure all places that set MD_CHANGE_PENDING are atomicial, and bit_clear_unless (suggested by Neil) is introduced for the purpose. Cc: Martin Kepplinger <martink@posteo.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: <linux-kernel@vger.kernel.org> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: raid5: add prerequisite to run underneath dm-raidHeinz Mauelshagen2016-05-091-2/+4
| | | | | | | | | | In case md runs underneath the dm-raid target, the mddev does not have a request queue or gendisk, thus avoid accesses. This patch adds a missing conditional to the raid5 personality. Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: raid10: add prerequisite to run underneath dm-raidHeinz Mauelshagen2016-05-091-4/+8
| | | | | | | | | | In case md runs underneath the dm-raid target, the mddev does not have a request queue or gendisk, thus avoid accesses to it. This patch adds two missing conditionals to the raid10 personality. Signed-of-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: md.c: fix oops in mddev_suspend for raid0Heinz Mauelshagen2016-05-091-1/+1
| | | | | | | | | | | | | | | | Introduced by upstream commit 70d9798b95562abac005d4ba71d28820f9a201eb The raid0 personality does not create mddev->thread as oposed to other personalities leading to its unconditional access in mddev_suspend() causing an oops. Patch checks for mddev->thread in order to keep the intention of aforementioned commit. Fixes: 70d9798b9556 ("MD: warn for potential deadlock") Cc: stable@vger.kernel.org (4.5+) Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: fix ifnullfree.cocci warningskbuild test robot2016-05-041-2/+1
| | | | | | | | | | | | | | | | drivers/md/bitmap.c:2049:6-11: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values. NULL check before some freeing functions is not needed. Based on checkpatch warning "kfree(NULL) is safe this check is probably not required" and kfreeaddr.cocci by Julia Lawall. Generated by: scripts/coccinelle/free/ifnullfree.cocci Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster/bitmap: unplug bitmap to sync dirty pages to diskGuoqing Jiang2016-05-041-5/+5
| | | | | | | | | | | | | | | | | | This patch is doing two distinct but related things. 1. It adds bitmap_unplug() for the main bitmap (mddev->bitmap). As bit have been set, BITMAP_PAGE_DIRTY is set so bitmap_deamon_work() will not write those pages out in its regular scans, only bitmap_unplug() will. If there are no writes to the array, bitmap_unplug() won't be called, so we need to call it explicitly here. 2. bitmap_write_all() is a bit of a confusing interface as it doesn't actually write anything. The current code for writing "bitmap" works but this change makes it a bit clearer. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and ↵Guoqing Jiang2016-05-041-3/+13
| | | | | | | | | | | | | | | | | bitmap_file_set_bit The pnum passed to set_page_attr and test_page_attr should from 0 to storage.file_pages - 1, but bitmap_file_set_bit and bitmap_file_clear_bit call set_page_attr and test_page_attr with page->index parameter while page->index has already added node_offset before. So we need to minus node_offset in both bitmap_file_clear_bit and bitmap_file_set_bit. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster/bitmap: fix wrong calcuation of offsetGuoqing Jiang2016-05-041-1/+1
| | | | | | | | | | | | | | | The offset is wrong in bitmap_storage_alloc, we should set it like below in bitmap_init_from_disk(). node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE)); Because 'offset' is only assigned to 'page->index' and that is usually over-written by read_sb_page. So it does not cause problem in general, but it still need to be fixed. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: sync bitmap when node received RESYNCING msgGuoqing Jiang2016-05-043-0/+51
| | | | | | | | | | | | | | | | If the node received RESYNCING message which means another node will perform resync with the area, then we don't want to do it again in another node. Let's set RESYNC_MASK and clear NEEDED_MASK for the region from old-low to new-low which has finished syncing, and the region from old-hi to new-hi is about to syncing, bitmap_sync_with_cluste is introduced for the purpose. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: always setup in-memory bitmapGuoqing Jiang2016-05-041-2/+35
| | | | | | | | | | | | | | | | | | | | | | | The in-memory bitmap for raid is allocated on demand, then for cluster scenario, it is possible that slave node which received RESYNCING message doesn't have the in-memory bitmap when master node is perform resyncing, so we can't make bitmap is match up well among each nodes. So for cluster scenario, we need always preserve the bitmap, and ensure the page will not be freed. And a no_hijack flag is introduced to both bitmap_checkpage and bitmap_get_counter, which makes cluster raid returns fail once allocate failed. And the next patch is relied on this change since it keeps sync bitmap among each nodes during resyncing stage. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: wakeup thread if activated a spare diskGuoqing Jiang2016-05-041-0/+5
| | | | | | | | | | | | | | | | | When a device is re-added, it will ultimately need to be activated and that happens in md_check_recovery, so we need to set MD_RECOVERY_NEEDED right after remove_and_add_spares. A specifical issue without the change is that when one node perform fail/remove/readd on a disk, but slave nodes could not add the disk back to array as expected (added as missed instead of in sync). So give slave nodes a chance to do resync. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: change array_sectors and update size are not supportedGuoqing Jiang2016-05-042-0/+14
| | | | | | | | | | Currently, some features are not supported yet, such as change array_sectors and update size, so return EINVAL for them and listed it in document. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: fix locking when node joins cluster during message broadcastGuoqing Jiang2016-05-041-3/+10
| | | | | | | | | | | | | | | | | | | | If a node joins the cluster while a message broadcast is under way, a lock issue could happen as follows. For a cluster which included two nodes, if node A is calling __sendmsg before up-convert CR to EX on ack, and node B released CR on ack. But if a new node C joins the cluster and it doesn't receive the message which A sent before, so it could hold CR on ack before A up-convert CR to EX on ack. So a node joining the cluster should get an EX lock on the "token" first to ensure no broadcast is ongoing, then release it after held CR on ack. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: unregister thread if err happenedGuoqing Jiang2016-05-041-0/+2
| | | | | | | | | The two threads need to be unregistered if a node can't join cluster successfully. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: wake up thread to continue recoveryGuoqing Jiang2016-05-041-3/+6
| | | | | | | | | In recovery case, we need to set MD_RECOVERY_NEEDED and wake up thread only if recover is not finished. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluser: make resync_finish only called after pers->sync_requestGuoqing Jiang2016-05-041-12/+13
| | | | | | | | | | | | | | | | | | It is not reasonable that cluster raid to release resync lock before the last pers->sync_request has finished. As the metadata will be changed when node performs resync, we need to inform other nodes to update metadata, so the MD_CHANGE_PENDING flag is set before finish resync. Then metadata_update_finish is move ahead to ensure that METADATA_UPDATED msg is sent before finish resync, and metadata_update_start need to be run after "repeat:" label accordingly. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: change resync lock from asynchronous to synchronousGuoqing Jiang2016-05-042-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | If multiple nodes choose to attempt do resync at the same time they need to be serialized so they don't duplicate effort. This serialization is done by locking the 'resync' DLM lock. Currently if a node cannot get the lock immediately it doesn't request notification when the lock becomes available (i.e. DLM_LKF_NOQUEUE is set), so it may not reliably find out when it is safe to try again. Rather than trying to arrange an async wake-up when the lock becomes available, switch to using synchronous locking - this is a lot easier to think about. As it is not permitted to block in the 'raid1d' thread, move the locking to the resync thread. So the rsync thread is forked immediately, but it blocks until the resync lock is available. Once the lock is locked it checks again if any resync action is needed. A particular symptom of the current problem is that a node can get stuck with "resync=pending" indefinitely. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2016-05-041-2/+2
|\ | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull IMA fix from James Morris. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: ima: fix the string representation of the LSM/IMA hook enumeration ordering
| * ima: fix the string representation of the LSM/IMA hook enumeration orderingMimi Zohar2016-05-041-2/+2
| | | | | | | | | | | | | | | | | | | | This patch fixes the string representation of the LSM/IMA hook enumeration ordering used for displaying the IMA policy. Fixes: d9ddf077bb85 ("ima: support for kexec image and initramfs") Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Tested-by: Eric Richter <erichte@linux.vnet.ibm.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
* | Merge tag 'for-linus-4.6-rc6-tag' of ↵Linus Torvalds2016-05-043-14/+26
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen regression fixes from David Vrabel: - Fix two regressions causing crashes in 32-bit PV guests - Fix a regression in the evtchn driver * tag 'for-linus-4.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/evtchn: fix ring resize when binding new events xen/balloon: Fix crash when ballooning on x86 32 bit PAE xen: Fix page <-> pfn conversion on 32 bit systems
| * xen/evtchn: fix ring resize when binding new eventsJan Beulich2016-05-041-12/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The copying of ring data was wrong for two cases: For a full ring nothing got copied at all (as in that case the canonicalized producer and consumer indexes are identical). And in case one or both of the canonicalized (after the resize) indexes would point into the second half of the buffer, the copied data ended up in the wrong (free) part of the new buffer. In both cases uninitialized data would get passed back to the caller. Fix this by simply copying the old ring contents twice: Once to the low half of the new buffer, and a second time to the high half. This addresses the inability to boot a HVM guest with 64 or more vCPUs. This regression was caused by 8620015499101090 (xen/evtchn: dynamically grow pending event channel ring). Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: <stable@vger.kernel.org> # 4.4+ Signed-off-by: David Vrabel <david.vrabel@citrix.com>
| * xen/balloon: Fix crash when ballooning on x86 32 bit PAERoss Lagerwall2016-04-061-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 55b3da98a40dbb3776f7454daf0d95dde25c33d2 (xen/balloon: find non-conflicting regions to place hotplugged memory) caused a regression in 4.4. When ballooning on an x86 32 bit PAE system with close to 64 GiB of memory, the address returned by allocate_resource may be above 64 GiB. When using CONFIG_SPARSEMEM, this setup is limited to using physical addresses < 64 GiB. When adding memory at this address, it runs off the end of the mem_section array and causes a crash. Instead, fail the ballooning request. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Cc: <stable@vger.kernel.org> # 4.4+ Signed-off-by: David Vrabel <david.vrabel@citrix.com>
| * xen: Fix page <-> pfn conversion on 32 bit systemsRoss Lagerwall2016-04-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 1084b1988d22dc165c9dbbc2b0e057f9248ac4db (xen: Add Xen specific page definition) caused a regression in 4.4. The xen functions to convert between pages and pfns fail due to an overflow on systems where a physical address may not fit in an unsigned long (e.g. x86 32 bit PAE systems). Rework the conversion to avoid overflow. This should also result in simpler object code. This bug manifested itself as disk corruption with Linux 4.4 when using blkfront in a Xen HVM x86 32 bit guest with more than 4 GiB of memory. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Cc: <stable@vger.kernel.org> # 4.4+ Signed-off-by: David Vrabel <david.vrabel@citrix.com>
* | Merge tag 'trace-fixes-v4.6-rc6' of ↵Linus Torvalds2016-05-041-2/+7
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Chunyu Hu noticed that if one writes into the trigger files within the ftrace subsystem of events that it can cause an oops. This file is only writable by root, but still is a bug that needs to be fixed" * tag 'trace-fixes-v4.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Don't display trigger file for events that can't be enabled
| * | tracing: Don't display trigger file for events that can't be enabledChunyu Hu2016-05-031-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently register functions for events will be called through the 'reg' field of event class directly without any check when seting up triggers. Triggers for events that don't support register through debug fs (events under events/ftrace are for trace-cmd to read event format, and most of them don't have a register function except events/ftrace/functionx) can't be enabled at all, and an oops will be hit when setting up trigger for those events, so just not creating them is an easy way to avoid the oops. Link: http://lkml.kernel.org/r/1462275274-3911-1-git-send-email-chuhu@redhat.com Cc: stable@vger.kernel.org # 3.14+ Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2016-05-0424-127/+259
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "Some straggler bug fixes: 1) Batman-adv DAT must consider VLAN IDs when choosing candidate nodes, from Antonio Quartulli. 2) Fix botched reference counting of vlan objects and neigh nodes in batman-adv, from Sven Eckelmann. 3) netem can crash when it sees GSO packets, the fix is to segment then upon ->enqueue. Fix from Neil Horman with help from Eric Dumazet. 4) Fix VXLAN dependencies in mlx5 driver Kconfig, from Matthew Finlay. 5) Handle VXLAN ops outside of rcu lock, via a workqueue, in mlx5, since it can sleep. Fix also from Matthew Finlay. 6) Check mdiobus_scan() return values properly in pxa168_eth and macb drivers. From Sergei Shtylyov. 7) If the netdevice doesn't support checksumming, disable segmentation. From Alexandery Duyck. 8) Fix races between RDS tcp accept and sending, from Sowmini Varadhan. 9) In macb driver, probe MDIO bus before we register the netdev, otherwise we can try to open the device before it is really ready for that. Fix from Florian Fainelli. 10) Netlink attribute size for ILA "tunnels" not calculated properly, fix from Nicolas Dichtel" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: ipv6/ila: fix nlsize calculation for lwtunnel net: macb: Probe MDIO bus before registering netdev RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock. RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock vxlan: Add checksum check to the features check function net: Disable segmentation if checksumming is not supported net: mvneta: Remove superfluous SMP function call macb: fix mdiobus_scan() error check pxa168_eth: fix mdiobus_scan() error check net/mlx5e: Use workqueue for vxlan ops net/mlx5e: Implement a mlx5e workqueue net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue net/mlx5: Unmap only the relevant IO memory mapping netem: Segment GSO packets on enqueue batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node batman-adv: Fix reference counting of vlan object for tt_local_entry batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event batman-adv: fix DAT candidate selection (must use vid)
| * | | ipv6/ila: fix nlsize calculation for lwtunnelNicolas Dichtel2016-05-031-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR. Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module") CC: Tom Herbert <tom@herbertland.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: macb: Probe MDIO bus before registering netdevFlorian Fainelli2016-05-031-12/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current sequence makes us register for a network device prior to registering and probing the MDIO bus which could lead to some unwanted consequences, like a thread of execution calling into ndo_open before register_netdev() returns, while the MDIO bus is not ready yet. Rework the sequence to register for the MDIO bus, and therefore attach to a PHY prior to calling register_netdev(), which implies reworking the error path a bit. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge branch 'rds-fixes'David S. Miller2016-05-034-19/+50
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sowmini Varadhan says: ==================== RDS: TCP: sychronization during connection startup This patch series ensures that the passive (accept) side of the TCP connection used for RDS-TCP is correctly synchronized with any concurrent active (connect) attempts for a given pair of peers. Patch 1 in the series makes sure that the t_sock in struct rds_tcp_connection is only reset after any threads in rds_tcp_xmit have completed (otherwise a null-ptr deref may be encountered). Patch 2 synchronizes rds_tcp_accept_one() with the rds_tcp*connect() path. v2: review comments from Santosh Shilimkar, other spelling corrections ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.Sowmini Varadhan2016-05-034-10/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An arbitration scheme for duelling SYNs is implemented as part of commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()") which ensures that both nodes involved will arrive at the same arbitration decision. However, this needs to be synchronized with an outgoing SYN to be generated by rds_tcp_conn_connect(). This commit achieves the synchronization through the t_conn_lock mutex in struct rds_tcp_connection. The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring the t_conn_lock mutex. A SYN is sent out only if the RDS connection is not already UP (an UP would indicate that rds_tcp_accept_one() has completed 3WH, so no SYN needs to be generated). Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after acquiring the t_conn_lock mutex. The only acceptable states (to allow continuation of the arbitration logic) are UP (i.e., outgoing SYN was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent outgoing SYN before we saw incoming SYN). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sockSowmini Varadhan2016-05-032-17/+25
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race condition between rds_send_xmit -> rds_tcp_xmit and the code that deals with resolution of duelling syns added by commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()"). Specifically, we may end up derefencing a null pointer in rds_send_xmit if we have the interleaving sequence: rds_tcp_accept_one rds_send_xmit conn is RDS_CONN_UP, so invoke rds_tcp_xmit tc = conn->c_transport_data rds_tcp_restore_callbacks /* reset t_sock */ null ptr deref from tc->t_sock The race condition can be avoided without adding the overhead of additional locking in the xmit path: have rds_tcp_accept_one wait for rds_tcp_xmit threads to complete before resetting callbacks. The synchronization can be done in the same manner as rds_conn_shutdown(). First set the rds_conn_state to something other than RDS_CONN_UP (so that new threads cannot get into rds_tcp_xmit()), then wait for RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any threads in rds_tcp_xmit are done. Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge branch 'tunnel-csum-and-sg-offloads'David S. Miller2016-05-033-2/+9
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexander Duyck says: ==================== Fixes for tunnel checksum and segmentation offloads This patch series is a subset of patches I had submitted for net-next. I plan to drop these two patches from the v3 of "Fix Tunnel features and enable GSO partial for several drivers" and I am instead submitting them for net since these are truly fixes and likely will need to be backported to stable branches. This series addresses 2 specific issues. The first is that we could request TSO on a v4 inner header while not supporting checksum offload of the outer IPv6 header. The second is that we could request an IPv6 inner checksum offload without validating that we could actually support an inner IPv6 checksum offload. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | vxlan: Add checksum check to the features check functionAlexander Duyck2016-05-032-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to perform an additional check on the inner headers to determine if we can offload the checksum for them. Previously this check didn't occur so we would generate an invalid frame in the case of an IPv6 header encapsulated inside of an IPv4 tunnel. To fix this I added a secondary check to vxlan_features_check so that we can verify that we can offload the inner checksum. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | net: Disable segmentation if checksumming is not supportedAlexander Duyck2016-05-031-1/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case of the mlx4 and mlx5 driver they do not support IPv6 checksum offload for tunnels. With this being the case we should disable GSO in addition to the checksum offload features when we find that a device cannot perform a checksum on a given packet type. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: mvneta: Remove superfluous SMP function callAnna-Maria Gleixner2016-05-031-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 3b9d6da67e11 ("cpu/hotplug: Fix rollback during error-out in __cpu_disable()") it is ensured that callbacks of CPU_ONLINE and CPU_DOWN_PREPARE are processed on the hotplugged CPU. Due to this SMP function calls are no longer required. Replace smp_call_function_single() with a direct call to mvneta_percpu_enable() or mvneta_percpu_disable(). The functions do not require to be called with interrupts disabled, therefore the smp_call_function_single() calling convention is not preserved. Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: netdev@vger.kernel.org Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | macb: fix mdiobus_scan() error checkSergei Shtylyov2016-05-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now mdiobus_scan() returns ERR_PTR(-ENODEV) instead of NULL if the PHY device ID was read as all ones. As this was not an error before, this value should be filtered out now in this driver. Fixes: b74766a0a0fe ("phylib: don't return NULL from get_phy_device()") Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | pxa168_eth: fix mdiobus_scan() error checkSergei Shtylyov2016-05-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since mdiobus_scan() returns either an error code or NULL on error, the driver should check for both, not only for NULL, otherwise a crash is imminent... Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge branch 'mlx5-fixes'David S. Miller2016-05-036-29/+74
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Saeed Mahameed says: ==================== Mellanox 100G mlx5 fixes for 4.6-rc This small series provides some bug fixes for mlx5 driver. A small bug fix for iounmap of a null pointer, which dumps a warning on some archs. One patch to fix the VXLAN/MLX5_EN dependency issue reported by Arnd. Two patches to fix the scheduling while atomic issue for ndo_add/del_vxlan_port NDOs. The first will add an internal mlx5e workqueue and the second will delegate vxlan ports add/del requests to that workqueue. Note: ('net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue') is only needed for net and not net-next as the issue was globally fixed for all device drivers by: b7aade15485a ('vxlan: break dependency with netdev drivers') in net-next. Applied on top: f27337e16f2d ('ip_tunnel: fix preempt warning in ip tunnel creation/updating') ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | net/mlx5e: Use workqueue for vxlan opsMatthew Finlay2016-05-033-16/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vxlan add/delete port NDOs are called under rcu lock. The current mlx5e implementation can potentially block in these calls, which is not allowed. Move to using the mlx5e workqueue to handle these NDOs. Fixes: b3f63c3d5e2c ('net/mlx5e: Add netdev support for VXLAN tunneling') Signed-off-by: Matthew Finlay <matt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | net/mlx5e: Implement a mlx5e workqueueMatthew Finlay2016-05-032-11/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement a mlx5e workqueue to handle all mlx5e specific tasks. Move all tasks currently using the system workqueue to the new workqueue. This is in preparation for vxlan using the mlx5e workqueue in order to schedule port add/remove operations. Signed-off-by: Matthew Finlay <matt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issueMatthew Finlay2016-05-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When MLX5_EN=y MLX5_CORE=y and VXLAN=m there is a linker error for vxlan_get_rx_port() due to the fact that VXLAN is a module. Change Kconfig to select VXLAN when MLX5_CORE=y. When MLX5_CORE=m there is no dependency on the value of VXLAN. Fixes: b3f63c3d5e2c ('net/mlx5e: Add netdev support for VXLAN tunneling') Signed-off-by: Matthew Finlay <matt@mellanox.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | net/mlx5: Unmap only the relevant IO memory mappingGal Pressman2016-05-031-2/+4
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When freeing UAR the driver tries to unmap uar->map and uar->bf_map which are mutually exclusive thus always unmapping a NULL pointer. Make sure we only call iounmap() once, for the actual mapping. Fixes: 0ba422410bbf ('net/mlx5: Fix global UAR mapping') Signed-off-by: Gal Pressman <galp@mellanox.com> Reported-by: Doron Tsur <doront@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | netem: Segment GSO packets on enqueueNeil Horman2016-05-031-2/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was recently reported to me, and reproduced on the latest net kernel, when attempting to run netperf from a host that had a netem qdisc attached to the egress interface: [ 788.073771] ---------------------[ cut here ]--------------------------- [ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda() [ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962 data_len=0 gso_size=1448 gso_type=1 ip_summed=3 [ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log dm_mod [ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W ------------ 3.10.0-327.el7.x86_64 #1 [ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012 [ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670 ffffffff816351f1 [ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200 ffff880231674000 [ 788.611943] 0000000000000001 0000000000000003 0000000000000000 ffff880437c03710 [ 788.647241] Call Trace: [ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b [ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0 [ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80 [ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100 [ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda [ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190 [ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem] [ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570 [ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0 ... The problem occurs because netem is not prepared to handle GSO packets (as it uses skb_checksum_help in its enqueue path, which cannot manipulate these frames). The solution I think is to simply segment the skb in a simmilar fashion to the way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes. When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt the first segment, and enqueue the remaining ones. tested successfully by myself on the latest net kernel, to which this applies Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Jamal Hadi Salim <jhs@mojatatu.com> CC: "David S. Miller" <davem@davemloft.net> CC: netem@lists.linux-foundation.org CC: eric.dumazet@gmail.com CC: stephen@networkplumber.org Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller2016-05-036-56/+41
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Antonio Quartulli says: ==================== In this small batch of patches you have: - a fix for our Distributed ARP Table that makes sure that the input provided to the hash function during a query is the same as the one provided during an insert (so to prevent false negatives), by Antonio Quartulli - a fix for our new protocol implementation B.A.T.M.A.N. V that ensures that a hard interface is properly re-activated when it is brought down and then up again, by Antonio Quartulli - two fixes respectively to the reference counting of the tt_local_entry and neigh_node objects, by Sven Eckelmann. Such bug is rather severe as it would prevent the netdev objects references by batman-adv from being released after shutdown. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | batman-adv: Fix reference counting of hardif_neigh_node object for neigh_nodeSven Eckelmann2016-04-292-11/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The batadv_neigh_node was specific to a batadv_hardif_neigh_node and held an implicit reference to it. But this reference was never stored in form of a pointer in the batadv_neigh_node itself. Instead batadv_neigh_node_release depends on a consistent state of hard_iface->neigh_list and that batadv_hardif_neigh_get always returns the batadv_hardif_neigh_node object which it has a reference for. But batadv_hardif_neigh_get cannot guarantee that because it is working only with rcu_read_lock on this list. It can therefore happen that a neigh_addr is in this list twice or that batadv_hardif_neigh_get cannot find the batadv_hardif_neigh_node for an neigh_addr due to some other list operations taking place at the same time. Instead add a batadv_hardif_neigh_node pointer directly in batadv_neigh_node which will be used for the reference counter decremented on release of batadv_neigh_node. Fixes: cef63419f7db ("batman-adv: add list of unique single hop neighbors per hard-interface") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
| | * | | batman-adv: Fix reference counting of vlan object for tt_local_entrySven Eckelmann2016-04-292-38/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The batadv_tt_local_entry was specific to a batadv_softif_vlan and held an implicit reference to it. But this reference was never stored in form of a pointer in the tt_local_entry itself. Instead batadv_tt_local_remove, batadv_tt_local_table_free and batadv_tt_local_purge_pending_clients depend on a consistent state of bat_priv->softif_vlan_list and that batadv_softif_vlan_get always returns the batadv_softif_vlan object which it has a reference for. But batadv_softif_vlan_get cannot guarantee that because it is working only with rcu_read_lock on this list. It can therefore happen that an vid is in this list twice or that batadv_softif_vlan_get cannot find the batadv_softif_vlan for an vid due to some other list operations taking place at the same time. Instead add a batadv_softif_vlan pointer directly in batadv_tt_local_entry which will be used for the reference counter decremented on release of batadv_tt_local_entry. Fixes: 35df3b298fc8 ("batman-adv: fix TT VLAN inconsistency on VLAN re-add") Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
| | * | | batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP eventAntonio Quartulli2016-04-293-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment there is no explicit reactivation of an hard-interface upon NETDEV_UP event. In case of B.A.T.M.A.N. IV the interface is reactivated as soon as the next OGM is scheduled for sending, but this mechanism does not work with B.A.T.M.A.N. V. The latter does not rely on the same scheduling mechanism as its predecessor and for this reason the hard-interface remains deactivated forever after being brought down once. This patch fixes the reactivation mechanism by adding a new routing API which explicitly allows each algorithm to perform any needed operation upon interface re-activation. Such API is optional and is implemented by B.A.T.M.A.N. V only and it just takes care of setting the iface status to ACTIVE Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
| | * | | batman-adv: fix DAT candidate selection (must use vid)Antonio Quartulli2016-04-291-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that DAT is VLAN aware, it must use the VID when computing the DHT address of the candidate nodes where an entry is going to be stored/retrieved. Fixes: be1db4f6615b ("batman-adv: make the Distributed ARP Table vlan aware") Signed-off-by: Antonio Quartulli <a@unstable.cc> [sven@narfation.org: fix conflicts with current version] Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>