summaryrefslogtreecommitdiffstats
path: root/net/sunrpc (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2016-05-246-62/+81
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "A very quiet cycle for nfsd, mainly just an RDMA update from Chuck Lever" * tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linux: sunrpc: fix stripping of padded MIC tokens svcrpc: autoload rdma module svcrdma: Generalize svc_rdma_xdr_decode_req() svcrdma: Eliminate code duplication in svc_rdma_recvfrom() svcrdma: Drain QP before freeing svcrdma_xprt svcrdma: Post Receives only for forward channel requests svcrdma: Remove superfluous line from rdma_read_chunks() svcrdma: svc_rdma_put_context() is invoked twice in Send error path svcrdma: Do not add XDR padding to xdr_buf page vector svcrdma: Support IPv6 with NFS/RDMA nfsd: handle seqid wraparound in nfsd4_preprocess_layout_stateid Remove unnecessary allocation
| * sunrpc: fix stripping of padded MIC tokensTomáš Trnka2016-05-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The length of the GSS MIC token need not be a multiple of four bytes. It is then padded by XDR to a multiple of 4 B, but unwrap_integ_data() would previously only trim mic.len + 4 B. The remaining up to three bytes would then trigger a check in nfs4svc_decode_compoundargs(), leading to a "garbage args" error and mount failure: nfs4svc_decode_compoundargs: compound not properly padded! nfsd: failed to decode arguments! This would prevent older clients using the pre-RFC 4121 MIC format (37-byte MIC including a 9-byte OID) from mounting exports from v3.9+ servers using krb5i. The trimming was introduced by commit 4c190e2f913f ("sunrpc: trim off trailing checksum before returning decrypted or integrity authenticated buffer"). Fixes: 4c190e2f913f "unrpc: trim off trailing checksum..." Signed-off-by: Tomáš Trnka <ttrnka@mail.muni.cz> Cc: stable@vger.kernel.org Acked-by: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrpc: autoload rdma moduleJ. Bruce Fields2016-05-231-4/+19
| | | | | | | | | | | | | | | | | | | | | | This should fix failures like: # rpc.nfsd --rdma rpc.nfsd: Unable to request RDMA services: Protocol not supported Reported-by: Steve Dickson <steved@redhat.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Generalize svc_rdma_xdr_decode_req()Chuck Lever2016-05-132-11/+23
| | | | | | | | | | | | | | | | | | | | Clean up: Pass in just the piece of the svc_rqst that is needed here. While we're in the area, add an informative documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Eliminate code duplication in svc_rdma_recvfrom()Chuck Lever2016-05-131-21/+5
| | | | | | | | | | | | | | Clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Drain QP before freeing svcrdma_xprtChuck Lever2016-05-131-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | If the server has forced a disconnect, the associated QP has not been moved to the Error state, and thus Receives are still posted. Ensure Receives (and any other outstanding WRs) are drained to release resources that can be freed during teardown of the svcrdma_xprt. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Post Receives only for forward channel requestsChuck Lever2016-05-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since backward direction support was added, the rq_depth was increased to accommodate both forward and backward Receives. But only forward Receives need to be posted after a connection has been accepted. Receives for backward replies are posted as needed by svc_rdma_bc_sendto(). This doesn't break anything, but it means some resources are wasted. Fixes: 03fe9931536f ('svcrdma: Define maximum number of ...') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Remove superfluous line from rdma_read_chunks()Chuck Lever2016-05-131-3/+1
| | | | | | | | | | | | | | | | | | Clean up: svc_rdma_get_read_chunk() already returns a pointer to the Read list. No need to set "ch" again to the value it already contains. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: svc_rdma_put_context() is invoked twice in Send error pathChuck Lever2016-05-131-15/+13
| | | | | | | | | | | | | | | | | | Get a fresh op_ctxt in send_reply() instead of in svc_rdma_sendto(). This ensures that svc_rdma_put_context() is invoked only once if send_reply() fails. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Do not add XDR padding to xdr_buf page vectorChuck Lever2016-05-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An xdr_buf has a head, a vector of pages, and a tail. Each RPC request is presented to the NFS server contained in an xdr_buf. The RDMA transport would like to supply the NFS server with only the NFS WRITE payload bytes in the page vector. In some common cases, that would allow the NFS server to swap those pages right into the target file's page cache. Have the transport's RDMA Read logic put XDR pad bytes in the tail iovec, and not in the pages that hold the data payload. The NFSv3 WRITE XDR decoder is finicky about the lengths involved, so make sure it is looking in the correct places when computing the total length of the incoming NFS WRITE request. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Support IPv6 with NFS/RDMAShirley Ma2016-05-131-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow both IPv4 and IPv6 to bind same port at the same time, restricts use of the IPv6 socket to IPv6 communication. Changes from v1: - Check rdma_set_afonly return value (suggested by Leon Romanovsky) Changes from v2: - Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Shirley Ma <shirley.ma@oracle.com> Acked-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * Remove unnecessary allocationJ. Bruce Fields2016-05-031-3/+2
| | | | | | | | | | Reported-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | Merge tag 'for-linus' of ↵Linus Torvalds2016-05-202-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "Primary 4.7 merge window changes - Updates to the new Intel X722 iWARP driver - Updates to the hfi1 driver - Fixes for the iw_cxgb4 driver - Misc core fixes - Generic RDMA READ/WRITE API addition - SRP updates - Misc ipoib updates - Minor mlx5 updates" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (148 commits) IB/mlx5: Fire the CQ completion handler from tasklet net/mlx5_core: Use tasklet for user-space CQ completion events IB/core: Do not require CAP_NET_ADMIN for packet sniffing IB/mlx4: Fix unaligned access in send_reply_to_slave IB/mlx5: Report Scatter FCS device capability when supported IB/mlx5: Add Scatter FCS support for Raw Packet QP IB/core: Add Scatter FCS create flag IB/core: Add Raw Scatter FCS device capability IB/core: Add extended device capability flags i40iw: pass hw_stats by reference rather than by value i40iw: Remove unnecessary synchronize_irq() before free_irq() i40iw: constify i40iw_vf_cqp_ops structure IB/mlx5: Add UARs write-combining and non-cached mapping IB/mlx5: Allow mapping the free running counter on PROT_EXEC IB/mlx4: Use list_for_each_entry_safe IB/SA: Use correct free function IB/core: Fix a potential array overrun in CMA and SA agent IB/core: Remove unnecessary check in ibnl_rcv_msg IB/IWPM: Fix a potential skb leak RDMA/nes: replace custom print_hex_dump() ...
| * | IB/core: Enhance ib_map_mr_sg()Bart Van Assche2016-05-132-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SRP initiator allows to set max_sectors to a value that exceeds the largest amount of data that can be mapped at once with an mlx4 HCA using fast registration and a page size of 4 KB. Hence modify ib_map_mr_sg() such that it can map partial sg-elements. If an sg-element has been mapped partially, let the caller know which fraction has been mapped by adjusting *sg_offset. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Laurence Oberman <loberman@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Doug Ledford <dledford@redhat.com>
| * | IB/core: Add passing an offset into the SG to ib_map_mr_sgChristoph Hellwig2016-05-132-2/+2
| |/ | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | sunrpc: set SOCK_FASYNCEric Dumazet2016-05-131-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sunrpc is using SOCKWQ_ASYNC_NOSPACE without setting SOCK_FASYNC, so the recent optimizations done in sk_set_bit() and sk_clear_bit() broke it. There is still the risk that a subsequent sock_fasync() call would clear SOCK_FASYNC, but sunrpc does not use this yet. Fixes: 9317bb69824e ("net: SOCKWQ_ASYNC_NOSPACE optimizations") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jiri Pirko <jiri@resnulli.us> Reported-by: Huang, Ying <ying.huang@intel.com> Tested-by: Jiri Pirko <jiri@resnulli.us> Tested-by: Huang, Ying <ying.huang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: udp: rename UDP_INC_STATS_BH()Eric Dumazet2016-04-281-2/+2
| | | | | | | | | | | | | | | | Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(), and UDP6_INC_STATS_BH() to __UDP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-04-242-3/+8
|\| | | | | | | | | | | | | | | | | | | | | Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'linus' of ↵Linus Torvalds2016-04-152-3/+8
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes an NFS regression caused by the skcipher/hash conversion in sunrpc. It also fixes a build problem in certain configurations with bcm63xx" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: hwrng: bcm63xx - fix device tree compilation sunrpc: Fix skcipher/shash conversion
| | * sunrpc: Fix skcipher/shash conversionHerbert Xu2016-04-042-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The skcpiher/shash conversion introduced a number of bugs in the sunrpc code: 1) Missing calls to skcipher_request_set_tfm lead to crashes. 2) The allocation size of shash_desc is too small which leads to memory corruption. Fixes: 3b5cf20cf439 ("sunrpc: Use skcipher and ahash/shash") Reported-by: J. Bruce Fields <bfields@fieldses.org> Tested-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | | sock: tigthen lockdep checks for sock_owned_by_userHannes Frederic Sowa2016-04-142-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sock_owned_by_user should not be used without socket lock held. It seems to be a common practice to check .owned before lock reclassification, so provide a little help to abstract this check away. Cc: linux-cifs@vger.kernel.org Cc: linux-bluetooth@vger.kernel.org Cc: linux-nfs@vger.kernel.org Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | sunrpc: do not pull udp headers on receiveWillem de Bruijn2016-04-113-7/+5
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e6afc8ace6dd modified the udp receive path by pulling the udp header before queuing an skbuff onto the receive queue. Sunrpc also calls skb_recv_datagram to dequeue an skb from a udp socket. Modify this receive path to also no longer expect udp headers. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Franklin S Cooper Jr. <fcooper@ti.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Tested-by: Thierry Reding <treding@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usageKirill A. Shutemov2016-04-041-1/+1
| | | | | | | | | | | | | | | | | | Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov2016-04-047-38/+38
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2016-03-248-439/+359
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "Various bugfixes, a RDMA update from Chuck Lever, and support for a new pnfs layout type from Christoph Hellwig. The new layout type is a variant of the block layout which uses SCSI features to offer improved fencing and device identification. (Also: note this pull request also includes the client side of SCSI layout, with Trond's permission.)" * tag 'nfsd-4.6' of git://linux-nfs.org/~bfields/linux: sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race nfsd: recover: fix memory leak nfsd: fix deadlock secinfo+readdir compound nfsd4: resfh unused in nfsd4_secinfo svcrdma: Use new CQ API for RPC-over-RDMA server send CQs svcrdma: Use new CQ API for RPC-over-RDMA server receive CQs svcrdma: Remove close_out exit path svcrdma: Hook up the logic to return ERR_CHUNK svcrdma: Use correct XID in error replies svcrdma: Make RDMA_ERROR messages work rpcrdma: Add RPCRDMA_HDRLEN_ERR svcrdma: svc_rdma_post_recv() should close connection on error svcrdma: Close connection when a send error occurs nfsd: Lower NFSv4.1 callback message size limit svcrdma: Do not send Write chunk XDR pad with inline content svcrdma: Do not write xdr_buf::tail in a Write chunk svcrdma: Find client-provided write and reply chunks once per reply nfsd: Update NFS server comments related to RDMA support nfsd: Fix a memory leak when meeting unsupported state_protect_how4 nfsd4: fix bad bounds checking
| * sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a raceNeilBrown2016-03-171-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | sunrpc_cache_pipe_upcall() can detect a race if CACHE_PENDING is no longer set. In this case it aborts the queuing of the upcall. However it has already taken a new counted reference on "h" and doesn't "put" it, even though it frees the data structure holding the reference. So let's delay the "cache_get" until we know we need it. Fixes: f9e1aedc6c79 ("sunrpc/cache: remove races with queuing an upcall.") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Use new CQ API for RPC-over-RDMA server send CQsChuck Lever2016-03-014-175/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling ib_poll_cq() to sort through WCs during a completion is a common pattern amongst RDMA consumers. Since commit 14d3a3b2498e ("IB: add a proper completion queue abstraction"), WC sorting can be handled by the IB core. By converting to this new API, svcrdma is made a better neighbor to other RDMA consumers, as it allows the core to schedule the delivery of completions more fairly amongst all active consumers. This new API also aims each completion at a function that is specific to the WR's opcode. Thus the ctxt->wr_op field and the switch in process_context is replaced by a set of methods that handle each completion type. Because each ib_cqe carries a pointer to a completion method, the core can now post operations on a consumer's QP, and handle the completions itself. The server's rdma_stat_sq_poll and rdma_stat_sq_prod metrics are no longer updated. As a clean up, the cq_event_handler, the dto_tasklet, and all associated locking is removed, as they are no longer referenced or used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Use new CQ API for RPC-over-RDMA server receive CQsChuck Lever2016-03-011-90/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling ib_poll_cq() to sort through WCs during a completion is a common pattern amongst RDMA consumers. Since commit 14d3a3b2498e ("IB: add a proper completion queue abstraction"), WC sorting can be handled by the IB core. By converting to this new API, svcrdma is made a better neighbor to other RDMA consumers, as it allows the core to schedule the delivery of completions more fairly amongst all active consumers. Because each ib_cqe carries a pointer to a completion method, the core can now post operations on a consumer's QP, and handle the completions itself. svcrdma receive completions no longer use the dto_tasklet. Each polled Receive WC is now handled individually in soft IRQ context. The server transport's rdma_stat_rq_poll and rdma_stat_rq_prod metrics are no longer updated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Remove close_out exit pathChuck Lever2016-03-011-11/+1
| | | | | | | | | | | | | | | | | | | | Clean up: close_out is reached only when ctxt == NULL and XPT_CLOSE is already set. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Hook up the logic to return ERR_CHUNKChuck Lever2016-03-012-13/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RFC 5666 Section 4.2 states: > When the peer detects an RPC-over-RDMA header version that it does > not support (currently this document defines only version 1), it > replies with an error code of ERR_VERS, and provides the low and > high inclusive version numbers it does, in fact, support. And: > When other decoding errors are detected in the header or chunks, > either an RPC decode error MAY be returned or the RPC/RDMA error > code ERR_CHUNK MUST be returned. The Linux NFS server does throw ERR_VERS when a client sends it a request whose rdma_version is not "one." But it does not return ERR_CHUNK when a header decoding error occurs. It just drops the request. To improve protocol extensibility, it should reject invalid values in the rdma_proc field instead of treating them all like RDMA_MSG. Otherwise clients can't detect when the server doesn't support new rdma_proc values. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Use correct XID in error repliesChuck Lever2016-03-012-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When constructing an error reply, svc_rdma_xdr_encode_error() needs to view the client's request message so it can get the failing request's XID. svc_rdma_xdr_decode_req() is supposed to return a pointer to the client's request header. But if it fails to decode the client's message (and thus an error reply is needed) it does not return the pointer. The server then sends a bogus XID in the error reply. Instead, unconditionally generate the pointer to the client's header in svc_rdma_recvfrom(), and pass that pointer to both functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Make RDMA_ERROR messages workChuck Lever2016-03-014-65/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix several issues with svc_rdma_send_error(): - Post a receive buffer to replace the one that was consumed by the incoming request - Posting a send should use DMA_TO_DEVICE, not DMA_FROM_DEVICE - No need to put_page _and_ free pages in svc_rdma_put_context - Make sure the sge is set up completely in case the error path goes through svc_rdma_unmap_dma() - Replace the use of ENOSYS, which has a reserved meaning Related fixes in svc_rdma_recvfrom(): - Don't leak the ctxt associated with the incoming request - Don't close the connection after sending an error reply - Let svc_rdma_send_error() figure out the right header error code As a last clean up, move svc_rdma_send_error() to svc_rdma_sendto.c with other similar functions. There is some common logic in these functions that could someday be combined to reduce code duplication. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: svc_rdma_post_recv() should close connection on errorChuck Lever2016-03-014-24/+19
| | | | | | | | | | | | | | | | | | | | | | Clean up: Most svc_rdma_post_recv() call sites close the transport connection when a receive cannot be posted. Wrap that in a common helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Close connection when a send error occursChuck Lever2016-03-011-2/+6
| | | | | | | | | | Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * nfsd: Lower NFSv4.1 callback message size limitChuck Lever2016-03-012-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The maximum size of a backchannel message on RPC-over-RDMA depends on the connection's inline threshold. Today that threshold is typically 1024 bytes, making the maximum message size 996 bytes. The Linux server's CREATE_SESSION operation checks that the size of callback Calls can be as large as 1044 bytes, to accommodate RPCSEC_GSS. Thus CREATE_SESSION fails if a client advertises the true message size maximum of 996 bytes. But the server's backchannel currently does not support RPCSEC_GSS. The actual maximum size it needs is much smaller. It is safe to reduce the limit to enable NFSv4.1 on RDMA backchannel operation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Do not send Write chunk XDR pad with inline contentChuck Lever2016-03-012-6/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NFS server's XDR encoders adds an XDR pad for content in the xdr_buf page list at the beginning of the xdr_buf's tail buffer. On RDMA transports, Write chunks are sent separately and without an XDR pad. If a Write chunk is being sent, strip off the pad in the tail buffer so that inline content following the Write chunk remains XDR-aligned when it is sent to the client. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=294 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Do not write xdr_buf::tail in a Write chunkChuck Lever2016-03-011-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the Linux NFS server writes an odd-length data item into a Write chunk, it finishes with XDR pad bytes. If the data item is smaller than the Write chunk, the pad bytes are written at the end of the data item, but still inside the chunk (ie, in the application's buffer). Since this is direct data placement, that exposes the pad bytes. XDR pad bytes are inserted in order to preserve the XDR alignment of the next XDR data item in an XDR stream. But Write chunks do not appear in the payload XDR stream, and only one data item is allowed in each chunk. Thus XDR padding is not needed in a Write chunk. With NFSv4, the Linux NFS server places the results of any operations that follow an NFSv4 READ or READLINK in the xdr_buf's tail. Those results also should never be sent as a part of a Write chunk. The current logic in send_write_chunks() appears to assume that the xdr_buf's tail contains only pad bytes (ie, NFSv3). The server should write only the contents of the xdr_buf's page list in a Write chunk. If there's more than an XDR pad in the tail, that needs to go inline or in the Reply chunk. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=294 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
| * svcrdma: Find client-provided write and reply chunks once per replyChuck Lever2016-03-011-44/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The client provides the location of Write chunks into which the server writes bulk payload. The client provides these when the Upper Layer Protocol wants direct data placement and the Binding allows it. (For NFS, this is READ and READLINK operations). The client also provides the location of a Reply chunk into which the server writes the non-bulk part of an RPC reply. The client provides this chunk whenever it believes the reply can be larger than its receive buffers. The server then uses the presence of these chunks to determine how it will form its reply message. svc_rdma_sendto() was looking for Write and Reply chunks multiple times for every reply message. It would be more efficient to do it just once. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | Merge tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2016-03-2214-346/+1020
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - Add support for multiple NFSv4.1 callbacks in flight - Initial patchset for RPC multipath support - Adapt RPC/RDMA to use the new completion queue API Bugfixes and cleanups: - nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed - Cleanups to remove nfs_inode_dio_wait and nfs4_file_fsync - Fix RPC/RDMA credit accounting - Properly handle RDMA_ERROR replies - xprtrdma: Do not wait if ib_post_send() fails - xprtrdma: Segment head and tail XDR buffers on page boundaries - xprtrdma cleanups for dprintk, physical_op_map and unused macros" * tag 'nfs-for-4.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (35 commits) nfs/blocklayout: make sure making a aligned read request nfs4: nfs4_ff_layout_prepare_ds should return NULL if connection failed nfs: remove nfs_inode_dio_wait nfs: remove nfs4_file_fsync xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs xprtrdma: Use an anonymous union in struct rpcrdma_mw xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs xprtrdma: Serialize credit accounting again xprtrdma: Properly handle RDMA_ERROR replies rpcrdma: Add RPCRDMA_HDRLEN_ERR xprtrdma: Do not wait if ib_post_send() fails xprtrdma: Segment head and tail XDR buffers on page boundaries xprtrdma: Clean up dprintk format string containing a newline xprtrdma: Clean up physical_op_map() xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3 SUNRPC: Allow addition of new transports to a struct rpc_clnt NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections SUNRPC: Make NFS swap work with multipath ...
| * \ Merge tag 'nfs-rdma-4.6-1' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust2016-03-167-247/+253
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NFS: NFSoRDMA Client Side Changes These patches include several bugfixes and cleanups for the NFSoRDMA client. This includes bugfixes for NFS v4.1, proper RDMA_ERROR handling, and fixes from the recent workqueue swicchover. These patches also switch xprtrdma to use the new CQ API Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'nfs-rdma-4.6-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (787 commits) xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs xprtrdma: Use an anonymous union in struct rpcrdma_mw xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs xprtrdma: Serialize credit accounting again xprtrdma: Properly handle RDMA_ERROR replies rpcrdma: Add RPCRDMA_HDRLEN_ERR xprtrdma: Do not wait if ib_post_send() fails xprtrdma: Segment head and tail XDR buffers on page boundaries xprtrdma: Clean up dprintk format string containing a newline xprtrdma: Clean up physical_op_map() xprtrdma: Clean up unused RPCRDMA_INLINE_PAD_THRESH macro
| | * | xprtrdma: Use new CQ API for RPC-over-RDMA client send CQsChuck Lever2016-03-143-125/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling ib_poll_cq() to sort through WCs during a completion is a common pattern amongst RDMA consumers. Since commit 14d3a3b2498e ("IB: add a proper completion queue abstraction"), WC sorting can be handled by the IB core. By converting to this new API, xprtrdma is made a better neighbor to other RDMA consumers, as it allows the core to schedule the delivery of completions more fairly amongst all active consumers. Because each ib_cqe carries a pointer to a completion method, the core can now post its own operations on a consumer's QP, and handle the completions itself, without changes to the consumer. Send completions were previously handled entirely in the completion upcall handler (ie, deferring to a process context is unneeded). Thus IB_POLL_SOFTIRQ is a direct replacement for the current xprtrdma send code path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Use an anonymous union in struct rpcrdma_mwChuck Lever2016-03-143-36/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Make code more readable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQsChuck Lever2016-03-142-58/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling ib_poll_cq() to sort through WCs during a completion is a common pattern amongst RDMA consumers. Since commit 14d3a3b2498e ("IB: add a proper completion queue abstraction"), WC sorting can be handled by the IB core. By converting to this new API, xprtrdma is made a better neighbor to other RDMA consumers, as it allows the core to schedule the delivery of completions more fairly amongst all active consumers. Because each ib_cqe carries a pointer to a completion method, the core can now post its own operations on a consumer's QP, and handle the completions itself, without changes to the consumer. xprtrdma's reply processing is already handled in a work queue, but there is some initial order-dependent processing that is done in the soft IRQ context before a work item is scheduled. IB_POLL_SOFTIRQ is a direct replacement for the current xprtrdma receive code path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Serialize credit accounting againChuck Lever2016-03-143-9/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit fe97b47cd623 ("xprtrdma: Use workqueue to process RPC/RDMA replies") replaced the reply tasklet with a workqueue that allows RPC replies to be processed in parallel. Thus the credit values in RPC-over-RDMA replies can be applied in a different order than in which the server sent them. To fix this, revert commit eba8ff660b2d ("xprtrdma: Move credit update to RPC reply handler"). Reverting is done by hand to accommodate code changes that have occurred since then. Fixes: fe97b47cd623 ("xprtrdma: Use workqueue to process . . .") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Properly handle RDMA_ERROR repliesChuck Lever2016-03-141-8/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These are shorter than RPCRDMA_HDRLEN_MIN, and they need to complete the waiting RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Do not wait if ib_post_send() failsChuck Lever2016-03-141-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If ib_post_send() in ro_unmap_sync() fails, the WRs have not been posted, no completions will fire, and wait_for_completion() will wait forever. Skip the wait in that case. To ensure the MRs are invalid, disconnect. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Segment head and tail XDR buffers on page boundariesChuck Lever2016-03-141-10/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A single memory allocation is used for the pair of buffers wherein the RPC client builds an RPC call message and decodes its matching reply. These buffers are sized based on the maximum possible size of the RPC call and reply messages for the operation in progress. This means that as the call buffer increases in size, the start of the reply buffer is pushed farther into the memory allocation. RPC requests are growing in size. It used to be that both the call and reply buffers fit inside a single page. But these days, thanks to NFSv4 (and especially security labels in NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN, for example, now requires a 6KB allocation for a pair of call and reply buffers, and NFSv4 LOOKUP is not far behind. As the maximum size of a call increases, the reply buffer is pushed far enough into the buffer's memory allocation that a page boundary can appear in the middle of it. When the maximum possible reply size is larger than the client's RDMA receive buffers (currently 1KB), the client has to register a Reply chunk for the server to RDMA Write the reply into. The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and tail buffers would always be contained on a single page. It supplies just one segment for the head and one for the tail. FMR, for example, registers up to a page boundary (only a portion of the reply buffer in the OPEN case above). But without additional segments, it doesn't register the rest of the buffer. When the server tries to write the OPEN reply, the RDMA Write fails with a remote access error since the client registered only part of the Reply chunk. rpcrdma_convert_iovs() must split the XDR buffer into multiple segments, each of which are guaranteed not to contain a page boundary. That way fmr_op_map is given the proper number of segments to register the whole reply buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Clean up dprintk format string containing a newlineChuck Lever2016-03-141-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| | * | xprtrdma: Clean up physical_op_map()Chuck Lever2016-03-141-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | physical_op_unmap{_sync} don't use mr_nsegs, so don't bother to set it in physical_op_map. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | Merge branch 'multipath'Trond Myklebust2016-02-228-100/+768
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * multipath: NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3 SUNRPC: Allow addition of new transports to a struct rpc_clnt NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections SUNRPC: Make NFS swap work with multipath SUNRPC: Add a helper to apply a function to all the rpc_clnt's transports SUNRPC: Allow caller to specify the transport to use SUNRPC: Use the multipath iterator to assign a transport to each task SUNRPC: Make rpc_clnt store the multipath iterators SUNRPC: Add a structure to track multiple transports SUNRPC: Make freeing of struct xprt rcu-safe SUNRPC: Uninline xprt_get(); It isn't performance critical. SUNRPC: Reorder rpc_task to put waitqueue related info in same cachelines SUNRPC: Remove unused function rpc_task_reset_client