diff options
author | Bob Peterson <rpeterso@redhat.com> | 2019-08-30 19:31:02 +0200 |
---|---|---|
committer | Andreas Gruenbacher <agruenba@redhat.com> | 2019-09-04 20:22:17 +0200 |
commit | ad26967b9afa7faee22c3b79370cb5d9ab553493 (patch) | |
tree | 5062d7135c924b2fade3f01828750e4196382838 /fs/gfs2/glock.c | |
parent | gfs2: create function gfs2_glock_update_hold_time (diff) | |
download | linux-ad26967b9afa7faee22c3b79370cb5d9ab553493.tar.xz linux-ad26967b9afa7faee22c3b79370cb5d9ab553493.zip |
gfs2: Use async glocks for rename
Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can
reverse the roles of which directories are "old" and which are "new" for
the purposes of rename. This can cause deadlocks where two nodes end up
waiting for each other.
There can be several layers of directory dependencies across many nodes.
This patch fixes the problem by acquiring all gfs2_rename's inode glocks
asychronously and waiting for all glocks to be acquired. That way all
inodes are locked regardless of the order.
The timeout value for multiple asynchronous glocks is calculated to be
the total of the individual wait times for each glock times two.
Since gfs2_exchange is very similar to gfs2_rename, both functions are
patched in the same way.
A new async glock wait queue, sd_async_glock_wait, keeps a list of
waiters for these events. If gfs2's holder_wake function detects an
async holder, it wakes up any waiters for the event. The waiter only
tests whether any of its requests are still pending.
Since the glocks are sent to dlm asychronously, the wait function needs
to check to see which glocks, if any, were granted.
If a glock is granted by dlm (and therefore held), its minimum hold time
is checked and adjusted as necessary, as other glock grants do.
If the event times out, all glocks held thus far must be dequeued to
resolve any existing deadlocks. Then, if there are any outstanding
locking requests, we need to loop around and wait for dlm to respond to
those requests too. After we release all requests, we return -ESTALE to
the caller (vfs rename) which loops around and retries the request.
Node1 Node2
--------- ---------
1. Enqueue A Enqueue B
2. Enqueue B Enqueue A
3. A granted
6. B granted
7. Wait for B
8. Wait for A
9. A times out (since Node 1 holds A)
10. Dequeue B (since it was granted)
11. Wait for all requests from DLM
12. B Granted (since Node2 released it in step 10)
13. Rename
14. Dequeue A
15. DLM Grants A
16. Dequeue A (due to the timeout and since we
no longer have B held for our task).
17. Dequeue B
18. Return -ESTALE to vfs
19. VFS retries the operation, goto step 1.
This release-all-locks / acquire-all-locks may slow rename / exchange
down as both nodes struggle in the same way and do the same thing.
However, this will only happen when there is contention for the same
inodes, which ought to be rare.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Diffstat (limited to 'fs/gfs2/glock.c')
-rw-r--r-- | fs/gfs2/glock.c | 94 |
1 files changed, 92 insertions, 2 deletions
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 661350989e98..0290a22ebccf 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -305,6 +305,11 @@ static void gfs2_holder_wake(struct gfs2_holder *gh) clear_bit(HIF_WAIT, &gh->gh_iflags); smp_mb__after_atomic(); wake_up_bit(&gh->gh_iflags, HIF_WAIT); + if (gh->gh_flags & GL_ASYNC) { + struct gfs2_sbd *sdp = gh->gh_gl->gl_name.ln_sbd; + + wake_up(&sdp->sd_async_glock_wait); + } } /** @@ -959,6 +964,91 @@ int gfs2_glock_wait(struct gfs2_holder *gh) return gh->gh_error; } +static int glocks_pending(unsigned int num_gh, struct gfs2_holder *ghs) +{ + int i; + + for (i = 0; i < num_gh; i++) + if (test_bit(HIF_WAIT, &ghs[i].gh_iflags)) + return 1; + return 0; +} + +/** + * gfs2_glock_async_wait - wait on multiple asynchronous glock acquisitions + * @num_gh: the number of holders in the array + * @ghs: the glock holder array + * + * Returns: 0 on success, meaning all glocks have been granted and are held. + * -ESTALE if the request timed out, meaning all glocks were released, + * and the caller should retry the operation. + */ + +int gfs2_glock_async_wait(unsigned int num_gh, struct gfs2_holder *ghs) +{ + struct gfs2_sbd *sdp = ghs[0].gh_gl->gl_name.ln_sbd; + int i, ret = 0, timeout = 0; + unsigned long start_time = jiffies; + bool keep_waiting; + + might_sleep(); + /* + * Total up the (minimum hold time * 2) of all glocks and use that to + * determine the max amount of time we should wait. + */ + for (i = 0; i < num_gh; i++) + timeout += ghs[i].gh_gl->gl_hold_time << 1; + +wait_for_dlm: + if (!wait_event_timeout(sdp->sd_async_glock_wait, + !glocks_pending(num_gh, ghs), timeout)) + ret = -ESTALE; /* request timed out. */ + + /* + * If dlm granted all our requests, we need to adjust the glock + * minimum hold time values according to how long we waited. + * + * If our request timed out, we need to repeatedly release any held + * glocks we acquired thus far to allow dlm to acquire the remaining + * glocks without deadlocking. We cannot currently cancel outstanding + * glock acquisitions. + * + * The HIF_WAIT bit tells us which requests still need a response from + * dlm. + * + * If dlm sent us any errors, we return the first error we find. + */ + keep_waiting = false; + for (i = 0; i < num_gh; i++) { + /* Skip holders we have already dequeued below. */ + if (!gfs2_holder_queued(&ghs[i])) + continue; + /* Skip holders with a pending DLM response. */ + if (test_bit(HIF_WAIT, &ghs[i].gh_iflags)) { + keep_waiting = true; + continue; + } + + if (test_bit(HIF_HOLDER, &ghs[i].gh_iflags)) { + if (ret == -ESTALE) + gfs2_glock_dq(&ghs[i]); + else + gfs2_glock_update_hold_time(ghs[i].gh_gl, + start_time); + } + if (!ret) + ret = ghs[i].gh_error; + } + + if (keep_waiting) + goto wait_for_dlm; + + /* + * At this point, we've either acquired all locks or released them all. + */ + return ret; +} + /** * handle_callback - process a demote request * @gl: the glock @@ -1025,9 +1115,9 @@ __acquires(&gl->gl_lockref.lock) struct gfs2_holder *gh2; int try_futile = 0; - BUG_ON(gh->gh_owner_pid == NULL); + GLOCK_BUG_ON(gl, gh->gh_owner_pid == NULL); if (test_and_set_bit(HIF_WAIT, &gh->gh_iflags)) - BUG(); + GLOCK_BUG_ON(gl, true); if (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)) { if (test_bit(GLF_LOCK, &gl->gl_flags)) |