summaryrefslogtreecommitdiffstats
path: root/net (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2016-06-0125-71/+203
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix negative error code usage in ATM layer, from Stefan Hajnoczi. 2) If CONFIG_SYSCTL is disabled, the default TTL is not initialized properly. From Ezequiel Garcia. 3) Missing spinlock init in mvneta driver, from Gregory CLEMENT. 4) Missing unlocks in hwmb error paths, also from Gregory CLEMENT. 5) Fix deadlock on team->lock when propagating features, from Ivan Vecera. 6) Work around buffer offset hw bug in alx chips, from Feng Tang. 7) Fix double listing of SCTP entries in sctp_diag dumps, from Xin Long. 8) Various statistics bug fixes in mlx4 from Eric Dumazet. 9) Fix some randconfig build errors wrt fou ipv6 from Arnd Bergmann. 10) All of l2tp was namespace aware, but the ipv6 support code was not doing so. From Shmulik Ladkani. 11) Handle on-stack hrtimers properly in pktgen, from Guenter Roeck. 12) Propagate MAC changes properly through VLAN devices, from Mike Manning. 13) Fix memory leak in bnx2x_init_one(), from Vitaly Kuznetsov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits) sfc: Track RPS flow IDs per channel instead of per function usbnet: smsc95xx: fix link detection for disabled autonegotiation virtio_net: fix virtnet_open and virtnet_probe competing for try_fill_recv bnx2x: avoid leaking memory on bnx2x_init_one() failures fou: fix IPv6 Kconfig options openvswitch: update checksum in {push,pop}_mpls sctp: sctp_diag should dump sctp socket type net: fec: update dirty_tx even if no skb vlan: Propagate MAC address to VLANs atm: iphase: off by one in rx_pkt() atm: firestream: add more reserved strings vxlan: Accept user specified MTU value when create new vxlan link net: pktgen: Call destroy_hrtimer_on_stack() timer: Export destroy_hrtimer_on_stack() net: l2tp: Make l2tp_ip6 namespace aware Documentation: ip-sysctl.txt: clarify secure_redirects sfc: use flow dissector helpers for aRFS ieee802154: fix logic error in ieee802154_llsec_parse_dev_addr net: nps_enet: Disable interrupts before napi reschedule net/lapb: tuse %*ph to dump buffers ...
| * fou: fix IPv6 Kconfig optionsArnd Bergmann2016-05-312-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Kconfig options I added to work around broken compilation ended up screwing up things more, as I used the wrong symbol to control compilation of the file, resulting in IPv6 fou support to never be built into the kernel. Changing CONFIG_NET_FOU_IPV6_TUNNELS to CONFIG_IPV6_FOU fixes that problem, I had renamed the symbol in one location but not the other, and as the file is never being used by other kernel code, this did not lead to a build failure that I would have caught. After that fix, another issue with the same patch becomes obvious, as we 'select INET6_TUNNEL', which is related to IPV6_TUNNEL, but not the same, and this can still cause the original build failure when IPV6_TUNNEL is not built-in but IPV6_FOU is. The fix is equally trivial, we just need to select the right symbol. I have successfully build 350 randconfig kernels with this patch and verified that the driver is now being built. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Valentin Rothberg <valentinrothberg@gmail.com> Fixes: fabb13db448e ("fou: add Kconfig options for IPv6 support") Signed-off-by: David S. Miller <davem@davemloft.net>
| * openvswitch: update checksum in {push,pop}_mplsSimon Horman2016-05-311-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | In the case of CHECKSUM_COMPLETE the skb checksum should be updated in {push,pop}_mpls() as they the type in the ethernet header. As suggested by Pravin Shelar. Cc: Pravin Shelar <pshelar@nicira.com> Fixes: 25cd9ba0abc0 ("openvswitch: Add basic MPLS support to kernel") Signed-off-by: Simon Horman <simon.horman@netronome.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: sctp_diag should dump sctp socket typeXin Long2016-05-311-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we cannot distinguish that one sk is a udp or sctp style when we use ss to dump sctp_info. it's necessary to dump it as well. For sctp_diag, ss support is not officially available, thus there are no official users of this yet, so we can add this field in the middle of sctp_info without breaking user API. v1->v2: - move 'sctpi_s_type' field to the end of struct sctp_info, so that it won't cause incompatibility with applications already built. - add __reserved3 in sctp_info to make sure sctp_info is 8-byte alignment. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * vlan: Propagate MAC address to VLANsMike Manning2016-05-313-3/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MAC address of the physical interface is only copied to the VLAN when it is first created, resulting in an inconsistency after MAC address changes of only newly created VLANs having an up-to-date MAC. The VLANs should continue inheriting the MAC address of the physical interface until the VLAN MAC address is explicitly set to any value. This allows IPv6 EUI64 addresses for the VLAN to reflect any changes to the MAC of the physical interface and thus for DAD to behave as expected. Signed-off-by: Mike Manning <mmanning@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: pktgen: Call destroy_hrtimer_on_stack()Guenter Roeck2016-05-311-4/+4
| | | | | | | | | | | | | | | | | | If CONFIG_DEBUG_OBJECTS_TIMERS=y, hrtimer_init_on_stack() requires a matching call to destroy_hrtimer_on_stack() to clean up timer debug objects. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: l2tp: Make l2tp_ip6 namespace awareShmulik Ladkani2016-05-301-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | l2tp_ip6 tunnel and session lookups were still using init_net, although the l2tp core infrastructure already supports lookups keyed by 'net'. As a result, l2tp_ip6_recv discarded packets for tunnels/sessions created in namespaces other than the init_net. Fix, by using dev_net(skb->dev) or sock_net(sk) where appropriate. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ieee802154: fix logic error in ieee802154_llsec_parse_dev_addrBaozeng Ding2016-05-301-2/+2
| | | | | | | | | | | | | | | | Fix a logic error to avoid potential null pointer dereference. Signed-off-by: Baozeng Ding <sploving1@gmail.com> Reviewed-by: Stefan Schmidt<stefan@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/lapb: tuse %*ph to dump buffersAndy Shevchenko2016-05-303-15/+8
| | | | | | | | | | | | | | | | Use %*ph specifier to dump small buffers in hex format instead doing this byte-by-byte. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fou: add Kconfig options for IPv6 supportArnd Bergmann2016-05-303-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A previous patch added the fou6.ko module, but that failed to link in a couple of configurations: net/built-in.o: In function `ip6_tnl_encap_add_fou_ops': net/ipv6/fou6.c:88: undefined reference to `ip6_tnl_encap_add_ops' net/ipv6/fou6.c:94: undefined reference to `ip6_tnl_encap_add_ops' net/ipv6/fou6.c:97: undefined reference to `ip6_tnl_encap_del_ops' net/built-in.o: In function `ip6_tnl_encap_del_fou_ops': net/ipv6/fou6.c:106: undefined reference to `ip6_tnl_encap_del_ops' net/ipv6/fou6.c:107: undefined reference to `ip6_tnl_encap_del_ops' If CONFIG_IPV6=m, ip6_tnl_encap_add_ops/ip6_tnl_encap_del_ops are in a module, but fou6.c can still be built-in, and that obviously fails to link. Also, if CONFIG_IPV6=y, but CONFIG_IPV6_TUNNEL=m or CONFIG_IPV6_TUNNEL=n, the same problem happens for a different reason. This adds two new silent Kconfig symbols to work around both problems: - CONFIG_IPV6_FOU is now always set to 'm' if either CONFIG_NET_FOU=m or CONFIG_IPV6=m - CONFIG_IPV6_FOU_TUNNEL is set implicitly when IPV6_FOU is enabled and NET_FOU_IP_TUNNELS is also turned out, and it will ensure that CONFIG_IPV6_TUNNEL is also available. The options could be made user-visible as well, to give additional room for configuration, but it seems easier not to bother users with more choice here. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: aa3463d65e7b ("fou: Add encap ops for IPv6 tunnels") Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: fix double EPs display in sctp_diagXin Long2016-05-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have this situation: that EP hash table, contains only the EPs that are listening, while the transports one, has the opposite. We have to traverse both to dump all. But when we traverse the transports one we will also get EPs that are in the EP hash if they are listening. In this case, the EP is dumped twice. We will fix it by checking if the endpoint that is in the endpoint hash table contains any ep->asoc in there, as it means we will also find it via transport hash, and thus we can/should skip it, depending on the filters used, like 'ss -l'. Still, we should NOT skip it if the user is listing only listening endpoints, because then we are not traversing the transport hash. so we have to check idiag_states there also. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: hwbm: Fix unbalanced spinlock in error caseGregory CLEMENT2016-05-251-0/+3
| | | | | | | | | | | | | | | | | | | | | | When hwbm_pool_add exited in error the spinlock was not released. This patch fixes this issue. Fixes: 8cb2d8bf57e6 ("net: add a hardware buffer management helper API") Reported-by: Jean-Jacques Hiblot <jjhiblot@traphandler.com> Cc: <stable@vger.kernel.org> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix potential null pointer dereferences in some compat functionsBaozeng Ding2016-05-251-18/+93
| | | | | | | | | | | | | | | | | | Before calling the nla_parse_nested function, make sure the pointer to the attribute is not null. This patch fixes several potential null pointer dereference vulnerabilities in the tipc netlink functions. Signed-off-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net sched actions: policer missing timestamp processingJamal Hadi Salim2016-05-251-0/+11
| | | | | | | | | | | | | | | | Policer was not dumping or updating timestamps Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: avoid too many hrtimer_start() callsEric Dumazet2016-05-242-10/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I found a serious performance bug in packet schedulers using hrtimers. sch_htb and sch_fq are definitely impacted by this problem. We constantly rearm high resolution timers if some packets are throttled in one (or more) class, and other packets are flying through qdisc on another (non throttled) class. hrtimer_start() does not have the mod_timer() trick of doing nothing if expires value does not change : if (timer_pending(timer) && timer->expires == expires) return 1; This issue is particularly visible when multiple cpus can queue/dequeue packets on the same qdisc, as hrtimer code has to lock a remote base. I used following fix : 1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding it. 2) Cache watchdog prior expiration. hrtimer might provide this, but I prefer to not rely on some hrtimer internal. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit path.Haishuang Yan2016-05-241-0/+1
| | | | | | | | | | | | | | | | In gre6 xmit path, we are sending a GRE packet, so set fl6 proto to IPPROTO_GRE properly. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ip6_gre: Fix MTU setting for ip6gretapHaishuang Yan2016-05-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | When creat an ip6gretap interface with an unreachable route, the MTU is about 14 bytes larger than what was needed. If the remote address is reachable: ping6 2001:0:130::1 -c 2 PING 2001:0:130::1(2001:0:130::1) 56 data bytes 64 bytes from 2001:0:130::1: icmp_seq=1 ttl=64 time=1.46 ms 64 bytes from 2001:0:130::1: icmp_seq=2 ttl=64 time=81.1 ms Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: Fix non-initialized TTL when CONFIG_SYSCTL=nEzequiel Garcia2016-05-232-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob") moves the default TTL assignment, and as side-effect IPv4 TTL now has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y). The sysctl_ip_default_ttl is fundamental for IP to work properly, as it provides the TTL to be used as default. The defautl TTL may be used in ip_selected_ttl, through the following flow: ip_select_ttl ip4_dst_hoplimit net->ipv4.sysctl_ip_default_ttl This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl in net_init_net, called during ipv4's initialization. Without this commit, a kernel built without sysctl support will send all IP packets with zero TTL (unless a TTL is explicitly set, e.g. with setsockopt). Given a similar issue might appear on the other knobs that were namespaceify, this commit also moves them. Fixes: fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob") Signed-off-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/atm: sk_err_soft must be positiveStefan Hajnoczi2016-05-232-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sk_err and sk_err_soft fields are positive errno values and userspace applications rely on this when using getsockopt(SO_ERROR). ATM code places an -errno into sk_err_soft in sigd_send() and returns it from svc_addparty()/svc_dropparty(). Although I am not familiar with ATM code I came to this conclusion because: 1. sigd_send() msg->type cases as_okay and as_error both have: sk->sk_err = -msg->reply; while the as_addparty and as_dropparty cases have: sk->sk_err_soft = msg->reply; This is the source of the inconsistency. 2. svc_addparty() returns an -errno and assumes sk_err_soft is also an -errno: if (flags & O_NONBLOCK) { error = -EINPROGRESS; goto out; } ... error = xchg(&sk->sk_err_soft, 0); out: release_sock(sk); return error; This shows that sk_err_soft is indeed being treated as an -errno. This patch ensures that sk_err_soft is always a positive errno. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | remove lots of IS_ERR_VALUE abusesArnd Bergmann2016-05-281-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most users of IS_ERR_VALUE() in the kernel are wrong, as they pass an 'int' into a function that takes an 'unsigned long' argument. This happens to work because the type is sign-extended on 64-bit architectures before it gets converted into an unsigned type. However, anything that passes an 'unsigned short' or 'unsigned int' argument into IS_ERR_VALUE() is guaranteed to be broken, as are 8-bit integers and types that are wider than 'unsigned long'. Andrzej Hajda has already fixed a lot of the worst abusers that were causing actual bugs, but it would be nice to prevent any users that are not passing 'unsigned long' arguments. This patch changes all users of IS_ERR_VALUE() that I could find on 32-bit ARM randconfig builds and x86 allmodconfig. For the moment, this doesn't change the definition of IS_ERR_VALUE() because there are probably still architecture specific users elsewhere. Almost all the warnings I got are for files that are better off using 'if (err)' or 'if (err < 0)'. The only legitimate user I could find that we get a warning for is the (32-bit only) freescale fman driver, so I did not remove the IS_ERR_VALUE() there but changed the type to 'unsigned long'. For 9pfs, I just worked around one user whose calling conventions are so obscure that I did not dare change the behavior. I was using this definition for testing: #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \ unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO)) which ends up making all 16-bit or wider types work correctly with the most plausible interpretation of what IS_ERR_VALUE() was supposed to return according to its users, but also causes a compile-time warning for any users that do not pass an 'unsigned long' argument. I suggested this approach earlier this year, but back then we ended up deciding to just fix the users that are obviously broken. After the initial warning that caused me to get involved in the discussion (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus asked me to send the whole thing again. [ Updated the 9p parts as per Al Viro - Linus ] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Andrzej Hajda <a.hajda@samsung.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.org/lkml/2016/1/7/363 Link: https://lkml.org/lkml/2016/5/27/486 Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2016-05-266-1742/+3499
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This changeset has a few main parts: - Ilya has finished a huge refactoring effort to sync up the client-side logic in libceph with the user-space client code, which has evolved significantly over the last couple years, with lots of additional behaviors (e.g., how requests are handled when cluster is full and transitions from full to non-full). This structure of the code is more closely aligned with userspace now such that it will be much easier to maintain going forward when behavior changes take place. There are some locking improvements bundled in as well. - Zheng adds multi-filesystem support (multiple namespaces within the same Ceph cluster) - Zheng has changed the readdir offsets and directory enumeration so that dentry offsets are hash-based and therefore stable across directory fragmentation events on the MDS. - Zheng has a smorgasbord of bug fixes across fs/ceph" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits) ceph: fix wake_up_session_cb() ceph: don't use truncate_pagecache() to invalidate read cache ceph: SetPageError() for writeback pages if writepages fails ceph: handle interrupted ceph_writepage() ceph: make ceph_update_writeable_page() uninterruptible libceph: make ceph_osdc_wait_request() uninterruptible ceph: handle -EAGAIN returned by ceph_update_writeable_page() ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM ceph: block non-fatal signals for fault/page_mkwrite ceph: make logical calculation functions return bool ceph: tolerate bad i_size for symlink inode ceph: improve fragtree change detection ceph: keep leaf frag when updating fragtree ceph: fix dir_auth check in ceph_fill_dirfrag() ceph: don't assume frag tree splits in mds reply are sorted ceph: fix inode reference leak ceph: using hash value to compose dentry offset ceph: don't forbid marking directory complete after forward seek ceph: record 'offset' for each entry of readdir result ceph: define 'end/complete' in readdir reply as bit flags ...
| * | libceph: make ceph_osdc_wait_request() uninterruptibleYan, Zheng2016-05-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most cases, the sync IO should be uninterruptible. The fix is use killale wait function in ceph_osdc_wait_request(). Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | ceph: make logical calculation functions return boolZhang Zhuoyu2016-05-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes serverl logical caculation functions return bool to improve readability due to these particular functions only using 0/1 as their return value. No functional change. Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
| * | libceph: support for subscribing to "mdsmap.<id>" mapsIlya Dryomov2016-05-262-5/+14
| | | | | | | | | | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: replace ceph_monc_request_next_osdmap()Ilya Dryomov2016-05-262-14/+7
| | | | | | | | | | | | | | | | | | | | | ... with a wrapper around maybe_request_map() - no need for two osdmap-specific functions. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: take osdc->lock in osdmap_show() and dump flags in hexIlya Dryomov2016-05-261-5/+5
| | | | | | | | | | | | | | | | | | | | | There is now about a dozen CEPH_OSDMAP_* flags. This is a debugging interface, so just dump in hex instead of spelling each flag out. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: pool deletion detectionIlya Dryomov2016-05-261-6/+242
| | | | | | | | | | | | | | | | | | | | | | | | This adds the "map check" infrastructure for sending osdmap version checks on CALC_TARGET_POOL_DNE and completing in-flight requests with -ENOENT if the target pool doesn't exist or has just been deleted. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: async MON client generic requestsIlya Dryomov2016-05-261-106/+210
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION messages asynchronously and get a callback on completion. Refactor MON client to allow firing off generic requests asynchronously and add an async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is switched over and remains sync. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: support for checking on status of watchIlya Dryomov2016-05-261-1/+51
| | | | | | | | | | | | | | | | | | | | | | | | Implement ceph_osdc_watch_check() to be able to check on status of watch. Note that the time it takes for a watch/notify event to get delivered through the notify_wq is taken into account. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: support for sending notifiesIlya Dryomov2016-05-262-11/+226
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement ceph_osdc_notify() for sending notifies. Due to the fact that the current messenger can't do read-in into pagelists (it can only do write-out from them), I had to go with a page vector for a NOTIFY_COMPLETE payload, for now. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph, rbd: ceph_osd_linger_request, watch/notify v2Ilya Dryomov2016-05-263-249/+951
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds support and switches rbd to a new, more reliable version of watch/notify protocol. As with the OSD client update, this is mostly about getting the right structures linked into the right places so that reconnects are properly sent when needed. watch/notify v2 also requires sending regular pings to the OSDs - send_linger_ping(). A major change from the old watch/notify implementation is the introduction of ceph_osd_linger_request - linger requests no longer piggy back on ceph_osd_request. ceph_osd_event has been merged into ceph_osd_linger_request. All the details are now hidden within libceph, the interface consists of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack(). ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep the lifetime management simple. ceph_osdc_notify_ack() accepts an optional data payload, which is relayed back to the notifier. Portions of this patch are loosely based on work by Douglas Fuller <dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: wait_request_timeout()Ilya Dryomov2016-05-261-13/+21
| | | | | | | | | | | | | | | | | | | | | | | | The unwatch timeout is currently implemented in rbd. With watch/unwatch code moving into libceph, we are going to need a ceph_osdc_wait_request() variant with a timeout. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: request_init() and request_release_checks()Ilya Dryomov2016-05-261-17/+27
| | | | | | | | | | | | | | | | | | These are going to be used by request_reinit() code. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: a major OSD client updateIlya Dryomov2016-05-262-611/+587
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a major sync up, up to ~Jewel. The highlights are: - per-session request trees (vs a global per-client tree) - per-session locking (vs a global per-client rwlock) - homeless OSD session - no ad-hoc global per-client lists - support for pool quotas - foundation for watch/notify v2 support - foundation for map check (pool deletion detection) support The switchover is incomplete: lingering requests can be setup and teared down but aren't ever reestablished. This functionality is restored with the introduction of the new lingering infrastructure (ceph_osd_linger_request, linger_work, etc) in a later commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: protect osdc->osd_lru list with a spinlockIlya Dryomov2016-05-261-11/+18
| | | | | | | | | | | | | | | | | | | | | | | | OSD client is getting moved from the big per-client lock to a set of per-session locks. The big rwlock would only be held for read most of the time, so a global osdc->osd_lru needs additional protection. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: allocate ceph_osd with GFP_NOFAILIlya Dryomov2016-05-261-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | create_osd() is called way too deep in the stack to be able to error out in a sane way; a failing create_osd() just messes everything up. The current req_notarget list solution is broken - the list is never traversed as it's not entirely clear when to do it, I guess. If we were to start traversing it at regular intervals and retrying each request, we wouldn't be far off from what __GFP_NOFAIL is doing, so allocate OSD sessions with __GFP_NOFAIL, at least until we come up with a better fix. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: osd_init() and osd_cleanup()Ilya Dryomov2016-05-261-9/+37
| | | | | | | | | | | | | | | | | | These are going to be used by homeless OSD sessions code. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: handle_one_map()Ilya Dryomov2016-05-262-56/+138
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Separate osdmap handling from decoding and iterating over a bag of maps in a fresh MOSDMap message. This sets up the scene for the updated OSD client. Of particular importance here is the addition of pi->was_full, which can be used to answer "did this pool go full -> not-full in this map?". This is the key bit for supporting pool quotas. We won't be able to downgrade map_sem for much longer, so drop downgrade_write(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: allocate dummy osdmap in ceph_osdc_init()Ilya Dryomov2016-05-262-16/+29
| | | | | | | | | | | | | | | | | | | | | This leads to a simpler osdmap handling code, particularly when dealing with pi->was_full, which is introduced in a later commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: schedule tick from ceph_osdc_init()Ilya Dryomov2016-05-261-28/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Both homeless OSD sessions and watch/notify v2, introduced in later commits, require periodic ticks which don't depend on ->num_requests. Schedule the initial tick from ceph_osdc_init() and reschedule from handle_timeout() unconditionally. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: move schedule_delayed_work() in ceph_osdc_init()Ilya Dryomov2016-05-261-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up with handle_osds_timeout() running on invalid memory if any one of the allocations fails. Call schedule_delayed_work() after everything is setup, just before returning. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: redo callbacks and factor out MOSDOpReply decodingIlya Dryomov2016-05-261-153/+209
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If you specify ACK | ONDISK and set ->r_unsafe_callback, both ->r_callback and ->r_unsafe_callback(true) are called on ack. This is very confusing. Redo this so that only one of them is called: ->r_unsafe_callback(true), on ack ->r_unsafe_callback(false), on commit or ->r_callback, on ack|commit Decode everything in decode_MOSDOpReply() to reduce clutter. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: drop msg argument from ceph_osdc_callback_tIlya Dryomov2016-05-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | finish_read(), its only user, uses it to get to hdr.data_len, which is what ->r_result is set to on success. This gains us the ability to safely call callbacks from contexts other than reply, e.g. map check. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: switch to calc_target(), part 2Ilya Dryomov2016-05-262-200/+216
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The crux of this is getting rid of ceph_osdc_build_request(), so that MOSDOp can be encoded not before but after calc_target() calculates the actual target. Encoding now happens within ceph_osdc_start_request(). Also nuked is the accompanying bunch of pointers into the encoded buffer that was used to update fields on each send - instead, the entire front is re-encoded. If we want to support target->name_len != base->name_len in the future, there is no other way, because oid is surrounded by other fields in the encoded buffer. Encoding OSD ops and adding data items to the request message were mixed together in osd_req_encode_op(). While we want to re-encode OSD ops, we don't want to add duplicate data items to the message when resending, so all call to ceph_osdc_msg_data_add() are factored out into a new setup_request_data(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: switch to calc_target(), part 1Ilya Dryomov2016-05-262-97/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace __calc_request_pg() and most of __map_request() with calc_target() and start using req->r_t. ceph_osdc_build_request() however still encodes base_oid, because it's called before calc_target() is and target_oid is empty at that point in time; a printf in osdc_show() also shows base_oid. This is fixed in "libceph: switch to calc_target(), part 2". Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: introduce ceph_osd_request_target, calc_target()Ilya Dryomov2016-05-262-2/+276
| | | | | | | | | | | | | | | | | | | | | | | | Introduce ceph_osd_request_target, containing all mapping-related fields of ceph_osd_request and calc_target() for calculating mappings and populating it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: pi->min_size, pi->last_force_request_resendIlya Dryomov2016-05-262-5/+53
| | | | | | | | | | | | | | | | | | | | | Add and decode pi->min_size and pi->last_force_request_resend. These are going to be used by calc_target(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: make pgid_cmp() globalIlya Dryomov2016-05-261-11/+12
| | | | | | | | | | | | | | | | | | | | | calc_target() code is going to need to know how to compare PGs. Take lhs and rhs pgid by const * while at it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: rename ceph_calc_pg_primary()Ilya Dryomov2016-05-261-4/+5
| | | | | | | | | | | | | | | | | | | | | Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to emphasise that it returns acting primary. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | libceph: ceph_osds, ceph_pg_to_up_acting_osds()Ilya Dryomov2016-05-262-143/+197
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Knowning just acting set isn't enough, we need to be able to record up set as well to detect interval changes. This means returning (up[], up_len, up_primary, acting[], acting_len, acting_primary) and passing it around. Introduce and switch to ceph_osds to help with that. Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return both up and acting sets from it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>