| Commit message (Collapse) | Author | Files | Lines |
|
Clean up: The ro_unmap method is no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
There needs to be a safe method of releasing registered memory
resources when an RPC terminates. Safe can mean a number of things:
+ Doesn't have to sleep
+ Doesn't rely on having a QP in RTS
ro_unmap_safe will be that safe method. It can be used in cases
where synchronous memory invalidation can deadlock, or needs to have
an active QP.
The important case is fencing an RPC's memory regions after it is
signaled (^C) and before it exits. If this is not done, there is a
window where the server can write an RPC reply into memory that the
client has released and re-used for some other purpose.
Note that this is a full solution for FRWR, but FMR and physical
still have some gaps where a particularly bad server can wreak
some havoc on the client. These gaps are not made worse by this
patch and are expected to be exceptionally rare and timing-based.
They are noted in documenting comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Separate the DMA unmap operation from freeing the MW. In a
subsequent patch they will not always be done at the same time,
and they are not related operations (except by order; freeing
the MW must be the last step during invalidation).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
In a subsequent patch, the fr_xprt and fr_worker fields will be
needed by another memory registration mode. Move them into the
generic rpcrdma_mw structure that wraps struct rpcrdma_frmr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Maintain the order of invalidation and DMA unmapping when doing
a background MR reset.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
frwr_op_unmap_sync() is now invoked in a workqueue context, the same
as __frwr_queue_recovery(). There's no need to defer MR reset if
posting LOCAL_INV MRs fails.
This means that even when ib_post_send() fails (which should occur
very rarely) the invalidation and DMA unmapping steps are still done
in the correct order.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Move the the I/O direction field from rpcrdma_mr_seg into the
rpcrdma_frmr.
This makes it possible to DMA-unmap the frwr long after an RPC has
exited and its rpcrdma_mr_seg array has been released and re-used.
This might occur if an RPC times out while waiting for a new
connection to be established.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Clean up: Follow same naming convention as other fields in struct
rpcrdma_frwr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Clean up: Replace rpcrdma_flush_cqs() and rpcrdma_clean_cqs() with
the new ib_drain_qp() API.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
rpcrdma_create_chunks() has been replaced, and can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.
In fact, this assumption is asserted in the code:
if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC: %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
}
But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.
Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.
The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.
Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Update documenting comments to reflect code changes over the past
year.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Avoid the latency and interrupt overhead of registering a Write
chunk when handling NFS READ requests of a few hundred bytes or
less.
This change does not interoperate with Linux NFS/RDMA servers
that do not have commit 9d11b51ce7c1 ('svcrdma: Fix send_reply()
scatter/gather set-up'). Commit 9d11b51ce7c1 was introduced in v4.3,
and is included in 4.2.y, 4.1.y, and 3.18.y.
Oracle bug 22925946 has been filed to request that the above fix
be included in the Oracle Linux UEK4 NFS/RDMA server.
Red Hat bugzillas 1327280 and 1327554 have been filed to request
that RHEL NFS/RDMA server backports include the above fix.
Workaround: Replace the "proto=rdma,port=20049" mount options
with "proto=tcp" until commit 9d11b51ce7c1 is applied to your
NFS server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
When deciding whether to send a Call inline, rpcrdma_marshal_req
doesn't take into account header bytes consumed by chunk lists.
This results in Call messages on the wire that are sometimes larger
than the inline threshold.
Likewise, when a Write list or Reply chunk is in play, the server's
reply has to emit an RDMA Send that includes a larger-than-minimal
RPC-over-RDMA header.
The actual size of a Call message cannot be estimated until after
the chunk lists have been registered. Thus the size of each
RPC-over-RDMA header can be estimated only after chunks are
registered; but the decision to register chunks is based on the size
of that header. Chicken, meet egg.
The best a client can do is estimate header size based on the
largest header that might occur, and then ensure that inline content
is always smaller than that.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Send buffer space is shared between the RPC-over-RDMA header and
an RPC message. A large RPC-over-RDMA header means less space is
available for the associated RPC message, which then has to be
moved via an RDMA Read or Write.
As more segments are added to the chunk lists, the header increases
in size. Typical modern hardware needs only a few segments to
convey the maximum payload size, but some devices and registration
modes may need a lot of segments to convey data payload. Sometimes
so many are needed that the remaining space in the Send buffer is
not enough for the RPC message. Sending such a message usually
fails.
To ensure a transport can always make forward progress, cap the
number of RDMA segments that are allowed in chunk lists. This
prevents less-capable devices and memory registrations from
consuming a large portion of the Send buffer by reducing the
maximum data payload that can be conveyed with such devices.
For now I choose an arbitrary maximum of 8 RDMA segments. This
allows a maximum size RPC-over-RDMA header to fit nicely in the
current 1024 byte inline threshold with over 700 bytes remaining
for an inline RPC message.
The current maximum data payload of NFS READ or WRITE requests is
one megabyte. To convey that payload on a client with 4KB pages,
each chunk segment would need to handle 32 or more data pages. This
is well within the capabilities of FMR. For physical registration,
the maximum payload size on platforms with 4KB pages is reduced to
32KB.
For FRWR, a device's maximum page list depth would need to be at
least 34 to support the maximum 1MB payload. A device with a smaller
maximum page list depth means the maximum data payload is reduced
when using that device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Currently the sysctls that allow setting the inline threshold allow
any value to be set.
Small values only make the transport run slower. The default 1KB
setting is as low as is reasonable. And the logic that decides how
to divide a Send buffer between RPC-over-RDMA header and RPC message
assumes (but does not check) that the lower bound is not crazy (say,
57 bytes).
Send and receive buffers share a page with some control information.
Values larger than about 3KB can't be supported, currently.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
RPC-over-RDMA transports have a limit on how large a backward
direction (backchannel) RPC message can be. Ensure that the NFSv4.x
CREATE_SESSION operation advertises this limit to servers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Commit 176e21ee2ec8 ("SUNRPC: Support for RPC over AF_LOCAL
transports") added a 5-character netid, but did not bump
RPCBIND_MAXNETIDLEN from 4 to 5.
Fixes: 176e21ee2ec8 ("SUNRPC: Support for RPC over AF_LOCAL ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
RFC 5666: The "rdma" netid is to be used when IPv4 addressing
is employed by the underlying transport, and "rdma6" for IPv6
addressing.
Add mount -o proto=rdma6 option to support NFS/RDMA IPv6 addressing.
Changes from v2:
- Integrated comments from Chuck Level, Anna Schumaker, Trodt Myklebust
- Add a little more to the patch description to describe NFS/RDMA
IPv6 suggested by Chuck Level and Anna Schumaker
- Removed duplicated rdma6 define
- Remove Opt_xprt_rdma mountfamily since it doesn't support
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
OPEN_CREATE with EXCLUSIVE4_1 sends initial file permission.
Ignoring fact, that server have indicated that file mod is set, client
will send yet another SETATTR request, but, as mode is already set,
new SETATTR will be empty. This is not a problem, nevertheless
an extra roundtrip and slow open on high latency networks.
This change is aims to skip extra setattr after open if there are
no attributes to be set.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
This adds the copy_range file_ops function pointer used by the
sys_copy_range() function call. This patch only implements sync copies,
so if an async copy happens we decode the stateid and ignore it.
Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
|
|
Copy will use this to set up a commit request for a generic range. I
don't want to allocate a new pagecache entry for the file, so I needed
to change parts of the commit path to handle requests with a null
wb_page.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Commit 80f9642724af5 ("NFSv4.x: Enforce the ca_maxreponsesize_cached
on the back channel") causes an oops when it receives a callback with
cachethis=yes.
[ 109.667378] BUG: unable to handle kernel NULL pointer dereference at 00000000000002c8
[ 109.669476] IP: [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
[ 109.671216] PGD 0
[ 109.671736] Oops: 0000 [#1] SMP
[ 109.705427] CPU: 1 PID: 3579 Comm: nfsv4.1-svc Not tainted 4.5.0-rc1+ #1
[ 109.706987] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[ 109.709468] task: ffff8800b4408000 ti: ffff88008448c000 task.ti: ffff88008448c000
[ 109.711207] RIP: 0010:[<ffffffffa08a3e68>] [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
[ 109.713521] RSP: 0018:ffff88008448fca0 EFLAGS: 00010286
[ 109.714762] RAX: ffff880081ee202c RBX: ffff8800b7b5b600 RCX: 0000000000000001
[ 109.716427] RDX: 0000000000000008 RSI: 0000000000000008 RDI: 0000000000000000
[ 109.718091] RBP: ffff88008448fda8 R08: 0000000000000000 R09: 000000000b000000
[ 109.719757] R10: ffff880137786000 R11: ffff8800b7b5b600 R12: 0000000001000000
[ 109.721415] R13: 0000000000000002 R14: 0000000053270000 R15: 000000000000000b
[ 109.723061] FS: 0000000000000000(0000) GS:ffff880139640000(0000) knlGS:0000000000000000
[ 109.724931] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.726278] CR2: 00000000000002c8 CR3: 0000000034d50000 CR4: 00000000001406e0
[ 109.727972] Stack:
[ 109.728465] ffff880081ee202c ffff880081ee201c 000000008448fcc0 ffff8800baccb800
[ 109.730349] ffff8800baccc800 ffffffffa08d0380 0000000000000000 0000000000000000
[ 109.732211] ffff8800b7b5b600 0000000000000001 ffffffff81d073c0 ffff880081ee3090
[ 109.734056] Call Trace:
[ 109.734657] [<ffffffffa03795d4>] svc_process_common+0x5c4/0x6c0 [sunrpc]
[ 109.736267] [<ffffffffa0379a4c>] bc_svc_process+0x1fc/0x360 [sunrpc]
[ 109.737775] [<ffffffffa08a2c2c>] nfs41_callback_svc+0x10c/0x1d0 [nfsv4]
[ 109.739335] [<ffffffff810cb380>] ? prepare_to_wait_event+0xf0/0xf0
[ 109.740799] [<ffffffffa08a2b20>] ? nfs4_callback_svc+0x50/0x50 [nfsv4]
[ 109.742349] [<ffffffff810a6998>] kthread+0xd8/0xf0
[ 109.743495] [<ffffffff810a68c0>] ? kthread_park+0x60/0x60
[ 109.744776] [<ffffffff816abc4f>] ret_from_fork+0x3f/0x70
[ 109.746037] [<ffffffff810a68c0>] ? kthread_park+0x60/0x60
[ 109.747324] Code: cc 45 31 f6 48 8b 85 00 ff ff ff 44 89 30 48 8b 85 f8 fe ff ff 44 89 20 48 8b 9d 38 ff ff ff 48 8b bd 30 ff ff ff 48 85 db 74 4c <4c> 8b af c8 02 00 00 4d 8d a5 08 02 00 00 49 81 c5 98 02 00 00
[ 109.754361] RIP [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
[ 109.756123] RSP <ffff88008448fca0>
[ 109.756951] CR2: 00000000000002c8
[ 109.757738] ---[ end trace 2b8555511ab5dfb4 ]---
[ 109.758819] Kernel panic - not syncing: Fatal exception
[ 109.760126] Kernel Offset: disabled
[ 118.938934] ---[ end Kernel panic - not syncing: Fatal exception
It doesn't unlock the table nor does it set the cps->clp pointer which
is later needed by nfs4_cb_free_slot().
Fixes: 80f9642724af5 ("NFSv4.x: Enforce the ca_maxresponsesize_cached ...")
CC: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
There's no guarantee that an IP address in a different network namespace
actually represents the same endpoint.
Also, if we allow unprivileged nfs mounts some day then this might allow
an unprivileged user in another network namespace to misdirect somebody
else's nfs mounts.
If sharing between containers is really what's wanted then that could
still be arranged explicitly, for example with bind mounts.
Reported-by: "Eric W. Biederman" <ebiederm@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
At Connectathon 2016, we found that recent upstream Linux clients
would occasionally send a LOCK operation with a zero stateid. This
appeared to happen in close proximity to another thread returning
a delegation before unlinking the same file while it remained open.
Earlier, the client received a write delegation on this file and
returned the open stateid. Now, as it is getting ready to unlink the
file, it returns the write delegation. But there is still an open
file descriptor on that file, so the client must OPEN the file
again before it returns the delegation.
Since commit 24311f884189 ('NFSv4: Recovery of recalled read
delegations is broken'), nfs_open_delegation_recall() clears the
NFS_DELEGATED_STATE flag _before_ it sends the OPEN. This allows a
racing LOCK on the same inode to be put on the wire before the OPEN
operation has returned a valid open stateid.
To eliminate this race, serialize delegation return with the
acquisition of a file lock on the same file. Adopt the same approach
as is used in the unlock path.
This patch also eliminates a similar race seen when sending a LOCK
operation at the same time as returning a delegation on the same file.
Fixes: 24311f884189 ('NFSv4: Recovery of recalled read ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[Anna: Add sentence about LOCK / delegation race]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
A mirror can be shared between multiple layouts, even with different
iomodes. That makes stats gathering simpler, but it causes a problem
when we get different creds in READ vs. RW layouts.
The current code drops the newer credentials onto the floor when this
occurs. That's problematic when you fetch a READ layout first, and then
a RW. If the READ layout doesn't have the correct creds to do a write,
then writes will fail.
We could just overwrite the READ credentials with the RW ones, but that
would break the ability for the server to fence the layout for reads if
things go awry. We need to be able to revert to the earlier READ creds
if the RW layout is returned afterward.
The simplest fix is to just keep two sets of creds per mirror. One for
READ layouts and one for RW, and then use the appropriate set depending
on the iomode of the layout segment.
Also fix up some RCU nits that sparse found.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
We're just as likely to have allocation problems here as we would if we
delay looking up the credential like we currently do. Fix the code to
get a rpc_cred reference early, as soon as the mirror is set up.
This allows us to eliminate the mirror early if there is a problem
getting an rpc credential. This also allows us to drop the uid/gid
from the layout_mirror struct as well.
In the event that we find an existing mirror where this one would go, we
swap in the new creds unconditionally, and drop the reference to the old
one.
Note that the old ff_layout_update_mirror_cred function wouldn't set
this pointer unless the DS version was 3, but we don't know what the DS
version is at this point. I'm a little unclear on why it did that as you
still need creds to talk to v4 servers as well. I have the code set
it regardless of the DS version here.
Also note the change to using generic creds instead of calling
lookup_cred directly. With that change, we also need to populate the
group_info pointer in the acred as some functions expect that to never
be NULL. Instead of allocating one every time however, we can allocate
one when the module is loaded and share it since the group_info is
refcounted.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
In later patches, we're going to want to allow the creds to be updated
when we get a new layout with updated creds. Have this function take
a reference to the cred that is later put once the call has been
dispatched.
Also, prepare for this change by ensuring we follow RCU rules when
getting a reference to the cred as well.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
All the callers already call that function before calling into here,
so it ends up being a no-op anyway.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Sometimes we might have a RCU managed credential pointer and don't want
to use locking to handle it. Add a function that will take a reference
to the cred iff the refcount is not already zero. Callers can dereference
the pointer under the rcu_read_lock and use that function to take a
reference only if the cred is not on its way to destruction.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Add function rpc_lookup_generic_cred, which allows lookups of a generic
credential that's not current_cred().
[jlayton: add gfp_t parm]
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
We need to be able to call the generic_cred creator from different
contexts. Add a gfp_t parm to the crcreate operation and to
rpcauth_lookup_credcache. For now, we just push the gfp_t parms up
one level to the *_lookup_cred functions.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
An xdr_buf with head[0].iov_len = 0 and page_len = 0 will cause
xdr_init_decode() to incorrectly setup the xdr_stream. Specifically,
xdr->end is never initialized.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Commit ea2cf22 created nfs_commit_info and saved &inode->i_lock inside
this NFS specific structure. This obscures the usage of i_lock.
Instead, save struct inode * so later it's clear the spinlock taken is
i_lock.
Should be no functional change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
This will pop a warning if we count too many good bytes.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Like other resend paths, mark the (old) hdr as NFS_IOHDR_REDO. This
ensures the hdr completion function will not count the (old) hdr
as good bytes.
Also, vector the error back through the hdr->task.tk_status like other
retry calls.
This fixes a bug with the FlexFiles layout where libaio was reporting more
bytes read than requested.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
|
|
Do not load one entry beyond the end of the syscall table when the
syscall number of a traced process equals to __NR_Linux_syscalls.
Similar bug with regular processes was fixed by commit 3bb457af4fa8
("[PARISC] Fix bug when syscall nr is __NR_Linux_syscalls").
This bug was found by strace test suite.
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
Currently we read the tsc radio: ratio = (MSR_PLATFORM_INFO >> 8) & 0x1f;
Thus we get bit 8-12 of MSR_PLATFORM_INFO, however according to the SDM
(35.5), the ratio bits are bit 8-15.
Ignoring the upper bits can result in an incorrect tsc ratio, which causes the
TSC calibration and the Local APIC timer frequency to be incorrect.
Fix this problem by masking 0xff instead.
[ tglx: Massaged changelog ]
Fixes: 7da7c1561366 "x86, tsc: Add static (MSR) TSC calibration on Intel Atom SoCs"
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: stable@vger.kernel.org
Cc: Bin Gao <bin.gao@intel.com>
Cc: Len Brown <lenb@kernel.org>
Link: http://lkml.kernel.org/r/1462505619-5516-1-git-send-email-yu.c.chen@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Apparently patchwork ended up truncating the full name.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is another attempt to avoid a regression in wwn_to_u64() after that
started using get_unaligned_be64(), which in turn ran into a bug on
gcc-4.9 through 6.1.
The regression got introduced due to the combination of two separate
workarounds (commits e3bde9568d99: "include/linux/unaligned: force
inlining of byteswap operations" and ef3fb2422ffe: "scsi: fc: use
get/put_unaligned64 for wwn access") that each try to sidestep distinct
problems with gcc behavior (code growth and increased stack usage).
Unfortunately after both have been applied, a more serious gcc bug has
been uncovered, leading to incorrect object code that discards part of a
function and causes undefined behavior.
As part of this problem is how __builtin_constant_p gets evaluated on an
argument passed by reference into an inline function, this avoids the
use of __builtin_constant_p() for all architectures that set
CONFIG_ARCH_USE_BUILTIN_BSWAP. Most architectures do not set
ARCH_SUPPORTS_OPTIMIZED_INLINING, which means they probably do not
suffer from the problem in the qla2xxx driver, but they might still run
into it elsewhere.
Both of the original workarounds were only merged in the 4.6 kernel, and
the bug that is fixed by this patch should only appear if both are
there, so we probably don't need to backport the fix. On the other
hand, it works by simplifying the code path and should not have any
negative effects.
[arnd@arndb.de: fix older gcc warnings]
(http://lkml.kernel.org/r/12243652.bxSxEgjgfk@wuerfel)
Link: https://lkml.org/lkml/headers/2016/4/12/1103
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70232
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
Fixes: e3bde9568d99 ("include/linux/unaligned: force inlining of byteswap operations")
Fixes: ef3fb2422ffe ("scsi: fc: use get/put_unaligned64 for wwn access")
Link: http://lkml.kernel.org/r/1780465.XdtPJpi8Tt@wuerfel
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Josh Poimboeuf <jpoimboe@redhat.com> # on gcc-5.3
Tested-by: Quinn Tran <quinn.tran@qlogic.com>
Cc: Martin Jambor <mjambor@suse.cz>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
Cc: Jan Hubicka <hubicka@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Recently, we allow to save the stacktrace whose hashed value is 0. It
causes the problem that stackdepot could return 0 even if in success.
User of stackdepot cannot distinguish whether it is success or not so we
need to solve this problem. In this patch, 1 bit are added to handle
and make valid handle none 0 by setting this bit. After that, valid
handle will not be 0 and 0 handle will represent failure correctly.
Fixes: 33334e25769c ("lib/stackdepot.c: allow the stack trace hash to be zero")
Link: http://lkml.kernel.org/r/1462252403-1106-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Assume memory47 is the last online block left in node1. This will hang:
# echo offline > /sys/devices/system/node/node1/memory47/state
After a couple of minutes, the following pops up in dmesg:
INFO: task bash:957 blocked for more than 120 seconds.
Not tainted 4.6.0-rc6+ #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash D ffff8800b7adbaf8 0 957 951 0x00000000
Call Trace:
schedule+0x35/0x80
schedule_timeout+0x1ac/0x270
wait_for_completion+0xe1/0x120
kthread_stop+0x4f/0x110
kcompactd_stop+0x26/0x40
__offline_pages.constprop.28+0x7e6/0x840
offline_pages+0x11/0x20
memory_block_action+0x73/0x1d0
memory_subsys_offline+0x47/0x60
device_offline+0x86/0xb0
store_mem_state+0xda/0xf0
dev_attr_store+0x18/0x30
sysfs_kf_write+0x37/0x40
kernfs_fop_write+0x11d/0x170
__vfs_write+0x37/0x120
vfs_write+0xa9/0x1a0
SyS_write+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xa4
kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
actually exit. Check kthread_should_stop() to break out of the wait.
Fixes: 698b1b306 ("mm, compaction: introduce kcompactd").
Reported-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since the wildcard at the end of OF module aliases is gone, autoloading
of modules that don't match a device's last (most generic) compatible
value fails.
For example the CODA960 VPU on i.MX6Q has the SoC specific compatible
"fsl,imx6q-vpu" and the generic compatible "cnm,coda960". Since the
driver currently only works with knowledge about the SoC specific
integration, it doesn't list "cnm,cod960" in the module device table.
This results in the device compatible
"of:NvpuT<NULL>Cfsl,imx6q-vpuCcnm,coda960" not matching the module alias
"of:N*T*Cfsl,imx6q-vpu" anymore, whereas before commit 2f632369ab79
("modpost: don't add a trailing wildcard for OF module aliases") it
matched the module alias "of:N*T*Cfsl,imx6q-vpu*".
This patch adds two module aliases for each compatible, one without the
wildcard and one with "C*" appended.
$ modinfo coda | grep imx6q
alias: of:N*T*Cfsl,imx6q-vpuC*
alias: of:N*T*Cfsl,imx6q-vpu
Fixes: 2f632369ab79 ("modpost: don't add a trailing wildcard for OF module aliases")
Link: http://lkml.kernel.org/r/1462203339-15340-1-git-send-email-p.zabel@pengutronix.de
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Cc: Javier Martinez Canillas <javier@osg.samsung.com>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Sjoerd Simons <sjoerd.simons@collabora.co.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org> [4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If /proc/<PID>/environ gets read before the envp[] array is fully set up
in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
read more bytes than are actually written, as env_start will already be
set but env_end will still be zero, making the range calculation
underflow, allowing to read beyond the end of what has been written.
Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
zero. It is, apparently, intentionally set last in create_*_tables().
This bug was found by the PaX size_overflow plugin that detected the
arithmetic underflow of 'this_len = env_end - (env_start + src)' when
env_end is still zero.
The expected consequence is that userland trying to access
/proc/<PID>/environ of a not yet fully set up process may get
inconsistent data as we're in the middle of copying in the environment
variables.
Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Instead of using "zswap" as the name for all zpools created, add an
atomic counter and use "zswap%x" with the counter number for each zpool
created, to provide a unique name for each new zpool.
As zsmalloc, one of the zpool implementations, requires/expects a unique
name for each pool created, zswap should provide a unique name. The
zsmalloc pool creation does not fail if a new pool with a conflicting
name is created, unless CONFIG_ZSMALLOC_STAT is enabled; in that case,
zsmalloc pool creation fails with -ENOMEM. Then zswap will be unable to
change its compressor parameter if its zpool is zsmalloc; it also will
be unable to change its zpool parameter back to zsmalloc, if it has any
existing old zpool using zsmalloc with page(s) in it. Attempts to
change the parameters will result in failure to create the zpool. This
changes zswap to provide a unique name for each zpool creation.
Fixes: f1c54846ee45 ("zswap: dynamic pool creation")
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dan Streetman <dan.streetman@canonical.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After the THP refcounting change, obtaining a compound pages from
get_user_pages() no longer allows us to assume the entire compound page
is immediately mappable from a secondary MMU.
A secondary MMU doesn't want to call get_user_pages() more than once for
each compound page, in order to know if it can map the whole compound
page. So a secondary MMU needs to know from a single get_user_pages()
invocation when it can map immediately the entire compound page to avoid
a flood of unnecessary secondary MMU faults and spurious
atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
users).
Ideally instead of the page->_mapcount < 1 check, get_user_pages()
should return the granularity of the "page" mapping in the "mm" passed
to get_user_pages(). However it's non trivial change to pass the "pmd"
status belonging to the "mm" walked by get_user_pages up the stack (up
to the caller of get_user_pages). So the fix just checks if there is
not a single pte mapping on the page returned by get_user_pages, and in
turn if the caller can assume that the whole compound page is mapped in
the current "mm" (in a pmd_trans_huge()). In such case the entire
compound page is safe to map into the secondary MMU without additional
get_user_pages() calls on the surrounding tail/head pages. In addition
of being faster, not having to run other get_user_pages() calls also
reduces the memory footprint of the secondary MMU fault in case the pmd
split happened as result of memory pressure.
Without this fix after a MADV_DONTNEED (like invoked by QEMU during
postcopy live migration or balloning) or after generic swapping (with a
failure in split_huge_page() that would only result in pmd splitting and
not a physical page split), KVM would map the whole compound page into
the shadow pagetables, despite regular faults or userfaults (like
UFFDIO_COPY) may map regular pages into the primary MMU as result of the
pte faults, leading to the guest mode and userland mode going out of
sync and not working on the same memory at all times.
Any other secondary MMU notifier manager (KVM is just one of the many
MMU notifier users) will need the same information if it doesn't want to
run a flood of get_user_pages_fast and it can support multiple
granularity in the secondary MMU mappings, so I think it is justified to
be exposed not just to KVM.
The other option would be to move transparent_hugepage_adjust to
mm/huge_memory.c but that currently has all kind of KVM data structures
in it, so it's definitely not a cut-and-paste work, so I couldn't do a
fix as cleaner as this one for 4.6.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: "Li, Liang Z" <liang.z.li@intel.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Cc: Rajendra Nayak <rnayak@codeaurora.org>
Cc: Afzal Mohammed <afzal.mohd.ma@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
/proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
increasingly negative under compaction: which would add delay when
should be none, or no delay when should delay. The bug in compaction
was due to a recent mmotm patch, but much older instance of the bug was
also noticed in isolate_migratepages_range() which is used for CMA and
gigantic hugepage allocations.
The bug is caused by putback_movable_pages() in an error path
decrementing the isolated counters without them being previously
incremented by acct_isolated(). Fix isolate_migratepages_range() by
removing the error-path putback, thus reaching acct_isolated() with
migratepages still isolated, and leaving putback to caller like most
other places do.
Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
[vbabka@suse.cz: expanded the changelog]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Khugepaged attempts to raise min_free_kbytes if its set too low.
However, on boot khugepaged sets min_free_kbytes first from
subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
after from init_per_zone_wmark_min(), via a module_init() call.
Khugepaged used to use a late_initcall() to set min_free_kbytes (such
that it occurred after the core initialization), however this was
removed when the initialization of min_free_kbytes was integrated into
the starting of the khugepaged thread.
The fix here is simply to invoke the core initialization using a
core_initcall() instead of module_init(), such that the previous
initialization ordering is restored. I didn't restore the
late_initcall() since start_stop_khugepaged() already sets
min_free_kbytes via set_recommended_min_free_kbytes().
This was noticed when we had a number of page allocation failures when
moving a workload to a kernel with this new initialization ordering. On
an 8GB system this restores min_free_kbytes back to 67584 from 11365
when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.
Fixes: 79553da293d3 ("thp: cleanup khugepaged startup")
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|