readahead: make sure sync readahead reads needed page

Patch series "mm: Fix various readahead quirks". When we were internally testing performance of recent kernels, we have noticed quite variable performance of readahead arising from various quirks in readahead code. So I went on a cleaning spree there. This is a batch of patches resulting out of that. A quick testing in my test VM with the following fio job file: [global] direct=0 ioengine=sync invalidate=1 blocksize=4k size=10g readwrite=read [reader] numjobs=1 shows that this patch series improves the throughput from variable one in 310-340 MB/s range to rather stable one at 350 MB/s. As a side effect these cleanups also address the issue noticed by Bruz Zhang [1]. [1] https://lore.kernel.org/all/20240618114941.5935-1-zhangpengpeng0808@gmail.com/ Zhang Peng reported: : I test this batch of patch with fio, it indeed has a huge sppedup : in sequential read when block size is 4KiB. The result as follow, : for async read, iodepth is set to 128, and other settings : are self-evident. : : casename upstream withFix speedup : ---------------- -------- -------- ------- : randread-4k-sync 48991 47 : seqread-4k-sync 1162758 14229 : seqread-1024k-sync 1460208 1452522 : randread-4k-libaio 47467 4730 : randread-4k-posixaio 49190 49512 : seqread-4k-libaio 1085932 1234635 : seqread-1024k-libaio 1423341 1402214 -1 : seqread-4k-posixaio 1165084 1369613 1 : seqread-1024k-posixaio 1435422 1408808 -1.8 This patch (of 10): page_cache_sync_ra() is called when a folio we want to read is not in the page cache. It is expected that it creates the folio (and perhaps the following folios as well) and submits reads for them unless some error happens. However if index == ra->start + ra->size, ondemand_readahead() will treat the call as another async readahead hit. Thus ra->start will be advanced and we create pages and queue reads from ra->start + ra->size further. Consequentially the page at 'index' is not created and filemap_get_pages() has to always go through filemap_create_folio() path. This behavior has particularly unfortunate consequences when we have two IO threads sequentially reading from a shared file (as is the case when NFS serves sequential reads). In that case what can happen is: suppose ra->size == ra->async_size == 128, ra->start = 512 T1 T2 reads 128 pages at index 512 - hits async readahead mark filemap_readahead() ondemand_readahead() if (index == expected ...) ra->start = 512 + 128 = 640 ra->size = 128 ra->async_size = 128 page_cache_ra_order() blocks in ra_alloc_folio() reads 128 pages at index 640 - no page found page_cache_sync_readahead() ondemand_readahead() if (index == expected ...) ra->start = 640 + 128 = 768 ra->size = 128 ra->async_size = 128 page_cache_ra_order() submits reads from 768 - still no page found at index 640 filemap_create_folio() - goes on to index 641 page_cache_sync_readahead() ondemand_readahead() - founds ra is confused, trims is to small size finds pages were already inserted And as a result read performance suffers. Fix the problem by triggering async readahead case in ondemand_readahead() only if we are calling the function because we hit the readahead marker. In any other case we need to read the folio at 'index' and thus we cannot really use the current ra state. Note that the above situation could be viewed as a special case of file->f_ra state corruption. In fact two thread reading using the shared file can also seemingly corrupt file->f_ra in interesting ways due to concurrent access. I never saw that in practice and the fix is going to be much more complex so for now at least fix this practical problem while we ponder about the theoretically correct solution. Link: https://lkml.kernel.org/r/20240625100859.15507-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240625101909.12234-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
author: Jan Kara <jack@suse.cz> 2024-06-25 12:18:51 +0200
committer: Andrew Morton <akpm@linux-foundation.org> 2024-07-04 04:30:26 +0200
commit: 8051b82a0be05751e41be1dfa4201c131e589450 (patch)
tree: 8fe40e135d08e84ec4e5c732339355f640fba5bf
parent: mm/migrate: move NUMA hinting fault folio isolation + checks under PTL (diff)
download: linux-8051b82a0be05751e41be1dfa4201c131e589450.tar.xz
linux-8051b82a0be05751e41be1dfa4201c131e589450.zip
1 files changed, 1 insertions, 1 deletions
diff --git a/mm/readahead.c b/mm/readahead.c
index c1b23989d9ca..af0fbd302a38 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -580,7 +580,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	 */
 	expected = round_down(ra->start + ra->size - ra->async_size,
 			1UL << order);
-	if (index == expected || index == (ra->start + ra->size)) {
+	if (folio && index == expected) {
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max_pages);
 		ra->async_size = ra->size;
author	Jan Kara <jack@suse.cz>	2024-06-25 12:18:51 +0200
committer	Andrew Morton <akpm@linux-foundation.org>	2024-07-04 04:30:26 +0200
commit	8051b82a0be05751e41be1dfa4201c131e589450 (patch)
tree	8fe40e135d08e84ec4e5c732339355f640fba5bf
parent	mm/migrate: move NUMA hinting fault folio isolation + checks under PTL (diff)
download	linux-8051b82a0be05751e41be1dfa4201c131e589450.tar.xz linux-8051b82a0be05751e41be1dfa4201c131e589450.zip