btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range

We hit a regression while rolling out 5.2 internally where we were hitting the following panic kernel BUG at mm/page-writeback.c:2659! RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0 Call Trace: __process_pages_contig+0x25a/0x350 ? extent_clear_unlock_delalloc+0x43/0x70 submit_compressed_extents+0x359/0x4d0 normal_work_helper+0x15a/0x330 process_one_work+0x1f5/0x3f0 worker_thread+0x2d/0x3d0 ? rescuer_thread+0x340/0x340 kthread+0x111/0x130 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x1f/0x30 This is happening because the page is not locked when doing clear_page_dirty_for_io. Looking at the core dump it was because our async_extent had a ram_size of 24576 but our async_chunk range only spanned 20480, so we had a whole extra page in our ram_size for our async_extent. This happened because we try not to compress pages outside of our i_size, however a cleanup patch changed us to do actual_end = min_t(u64, i_size_read(inode), end + 1); which is problematic because i_size_read() can evaluate to different values in between checking and assigning. So either an expanding truncate or a fallocate could increase our i_size while we're doing writeout and actual_end would end up being past the range we have locked. I confirmed this was what was happening by installing a debug kernel that had actual_end = min_t(u64, i_size_read(inode), end + 1); if (actual_end > end + 1) { printk(KERN_ERR "KABOOM\n"); actual_end = end + 1; } and installing it onto 500 boxes of the tier that had been seeing the problem regularly. Last night I got my debug message and no panic, confirming what I expected. [ dsterba: the assembly confirms a tiny race window: mov 0x20(%rsp),%rax cmp %rax,0x48(%r15) # read movl $0x0,0x18(%rsp) mov %rax,%r12 mov %r14,%rax cmovbe 0x48(%r15),%r12 # eval Where r15 is inode and 0x48 is offset of i_size. The original fix was to revert 62b37622718c that would do an intermediate assignment and this would also avoid the doulble evaluation but is not future-proof, should the compiler merge the stores and call i_size_read anyway. There's a patch adding READ_ONCE to i_size_read but that's not being applied at the moment and we need to fix the bug. Instead, emulate READ_ONCE by two barrier()s that's what effectively happens. The assembly confirms single evaluation: mov 0x48(%rbp),%rax # read once mov 0x20(%rsp),%rcx mov $0x20,%edx cmp %rax,%rcx cmovbe %rcx,%rax mov %rax,(%rsp) mov %rax,%rcx mov %r14,%rax Where 0x48(%rbp) is inode->i_size stored to %eax. ] Fixes: 62b37622718c ("btrfs: Remove isize local variable in compress_file_range") CC: stable@vger.kernel.org # v5.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ changelog updated ] Signed-off-by: David Sterba <dsterba@suse.com>
author: Josef Bacik <josef@toxicpanda.com> 2019-10-11 15:03:54 +0200
committer: David Sterba <dsterba@suse.com> 2019-11-04 21:41:49 +0100
commit: d98da49977f67394db492f06c00b1fb1cc090c05 (patch)
tree: b51973e1ee9091531f1ef33d6876b31e857f6b84 /fs
parent: Btrfs: fix race leading to metadata space leak after task received signal (diff)
download: linux-d98da49977f67394db492f06c00b1fb1cc090c05.tar.xz
linux-d98da49977f67394db492f06c00b1fb1cc090c05.zip
1 files changed, 14 insertions, 1 deletions
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c3f386b7cc0b..c6dc4dd16cf7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -474,6 +474,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 	u64 start = async_chunk->start;
 	u64 end = async_chunk->end;
 	u64 actual_end;
+	u64 i_size;
 	int ret = 0;
 	struct page **pages = NULL;
 	unsigned long nr_pages;
@@ -488,7 +489,19 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
 	inode_should_defrag(BTRFS_I(inode), start, end, end - start + 1,
 			SZ_16K);
 
-	actual_end = min_t(u64, i_size_read(inode), end + 1);
+	/*
+	 * We need to save i_size before now because it could change in between
+	 * us evaluating the size and assigning it.  This is because we lock and
+	 * unlock the page in truncate and fallocate, and then modify the i_size
+	 * later on.
+	 *
+	 * The barriers are to emulate READ_ONCE, remove that once i_size_read
+	 * does that for us.
+	 */
+	barrier();
+	i_size = i_size_read(inode);
+	barrier();
+	actual_end = min_t(u64, i_size, end + 1);
 again:
 	will_compress = 0;
 	nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
author	Josef Bacik <josef@toxicpanda.com>	2019-10-11 15:03:54 +0200
committer	David Sterba <dsterba@suse.com>	2019-11-04 21:41:49 +0100
commit	d98da49977f67394db492f06c00b1fb1cc090c05 (patch)
tree	b51973e1ee9091531f1ef33d6876b31e857f6b84 /fs
parent	Btrfs: fix race leading to metadata space leak after task received signal (diff)
download	linux-d98da49977f67394db492f06c00b1fb1cc090c05.tar.xz linux-d98da49977f67394db492f06c00b1fb1cc090c05.zip