summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-5.6/libata-2020-01-27' of git://git.kernel.dk/linux-blockLinus Torvalds2020-01-274-18/+65
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull libata updates from Jens Axboe: "As usual pretty quiet, mostly trivial fixes (or dead code removal), outside of various fixes for ahci_bcrm" * tag 'for-5.6/libata-2020-01-27' of git://git.kernel.dk/linux-block: ata/acard_ahci: remove unused variable n_elem ata: pata_macio: fix comparing pointer to 0 ata: ahci_brcm: BCM7216 reset is self de-asserting ata: ahci_brcm: Perform reset after obtaining resources ata: brcm: fix reset controller API usage ata: brcm: mark PM functions as __maybe_unused ata: ahci_brcm: Support BCM7216 reset controller name dt-bindings: ata: Document BCM7216 AHCI controller compatible ata: ahci_brcm: Add a shutdown callback ata: ahci_brcm: Manage reset line during suspend/resume
| * ata/acard_ahci: remove unused variable n_elemAlex Shi2020-01-221-3/+1
| | | | | | | | | | | | | | | | | | | | | | No one care the varible acard_ahci in func acard_ahci_qc_prep. better to remove it. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-ide@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * ata: pata_macio: fix comparing pointer to 0Chen Zhou2020-01-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Fixes coccicheck warning: ./drivers/ata/pata_macio.c:982:31-32: WARNING comparing pointer to 0, suggest !E Compare pointer-typed values to NULL rather than 0. Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * ata: ahci_brcm: BCM7216 reset is self de-assertingFlorian Fainelli2020-01-181-4/+12
| | | | | | | | | | | | | | | | | | The BCM7216 reset controller line is self-deasserting, unlike other platforms, so make use of reset_control_reset() instead of reset_control_deassert(). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * ata: ahci_brcm: Perform reset after obtaining resourcesFlorian Fainelli2020-01-181-6/+6
| | | | | | | | | | | | | | | | | | | | | | Resources such as clocks, PHYs, regulators are likely to get a probe deferral return code, which could lead to the AHCI controller being reset a few times until it gets successfully probed. Since this is typically the most time consuming operation, move it after the resources have been acquired. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * ata: brcm: fix reset controller API usageArnd Bergmann2020-01-171-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | While fixing another issue in this driver I noticed it uses IS_ERR_OR_NULL(), which is almost always a mistake. Change the driver to use the proper devm_reset_control_get_optional() interface instead and remove the checks except for the one that checks for a failure in that function. Fixes: 2b2c47d9e1fe ("ata: ahci_brcm: Allow optional reset controller to be used") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * ata: brcm: mark PM functions as __maybe_unusedArnd Bergmann2020-01-171-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new shutdown callback causes a link failure: drivers/ata/ahci_brcm.c: In function 'brcm_ahci_shutdown': drivers/ata/ahci_brcm.c:552:8: error: implicit declaration of function 'brcm_ahci_suspend'; did you mean 'brcm_ahci_shutdown'? [-Werror=implicit-function-declaration] ret = brcm_ahci_suspend(&pdev->dev); ^~~~~~~~~~~~~~~~~ Remove the incorrect #ifdef and use __maybe_unused annotations instead to make this more robust. Fixes: 7de9b1688c1d ("ata: ahci_brcm: Add a shutdown callback") Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * ata: ahci_brcm: Support BCM7216 reset controller nameFlorian Fainelli2019-12-261-2/+10
| | | | | | | | | | | | | | | | | | | | | | BCM7216 uses a different reset controller name which is "rescal" instead of "ahci", match the compatible string to account for that minor difference, the reset is otherwise identical to how other generations of SATA controllers work. Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * dt-bindings: ata: Document BCM7216 AHCI controller compatibleFlorian Fainelli2019-12-261-0/+7
| | | | | | | | | | | | | | | | | | | | | | The BCM7216 AHCI controller makes use of a specific reset controller line/name, document the compatible string and the optional reset properties. Reviewed-by: Rob Herring <robh@kernel.org> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * ata: ahci_brcm: Add a shutdown callbackFlorian Fainelli2019-12-261-0/+15
| | | | | | | | | | | | | | | | | | Make sure that we quiesce the controller and shut down the clocks in a shutdown callback. Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * ata: ahci_brcm: Manage reset line during suspend/resumeFlorian Fainelli2019-12-261-2/+13
| | | | | | | | | | | | | | | | | | | | | | We were not managing the reset line during suspend/resume, but this needs to be done to ensure that the controller can exit low power modes correctly, especially with deep sleep suspend mode that may reset parts of the logic. Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-blockLinus Torvalds2020-01-2717-262/+571
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "Like the core side, not a lot of changes here, just two main items: - Series of patches (via Coly) with fixes for bcache (Coly, Christoph) - MD pull request from Song" * tag 'for-5.6/drivers-2020-01-27' of git://git.kernel.dk/linux-block: (31 commits) bcache: reap from tail of c->btree_cache in bch_mca_scan() bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() bcache: remove member accessed from struct btree bcache: print written and keys in trace_bcache_btree_write bcache: avoid unnecessary btree nodes flushing in btree_flush_write() bcache: add code comments for state->pool in __btree_sort() lib: crc64: include <linux/crc64.h> for 'crc64_be' bcache: use read_cache_page_gfp to read the superblock bcache: store a pointer to the on-disk sb in the cache and cached_dev structures bcache: return a pointer to the on-disk sb from read_super bcache: transfer the sb_page reference to register_{bdev,cache} bcache: fix use-after-free in register_bcache() bcache: properly initialize 'path' and 'err' in register_bcache() bcache: rework error unwinding in register_bcache bcache: use a separate data structure for the on-disk super block bcache: cached_dev_free needs to put the sb page md/raid1: introduce wait_for_serialization md/raid1: use bucket based mechanism for IO serialization md: introduce a new struct for IO serialization md: don't destroy serial_info_pool if serialize_policy is true ...
| * | bcache: reap from tail of c->btree_cache in bch_mca_scan()Coly Li2020-01-231-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When shrink btree node cache from c->btree_cache in bch_mca_scan(), no matter the selected node is reaped or not, it will be rotated from the head to the tail of c->btree_cache list. But in bcache journal code, when flushing the btree nodes with oldest journal entry, btree nodes are iterated and slected from the tail of c->btree_cache list in btree_flush_write(). The list_rotate_left() in bch_mca_scan() will make btree_flush_write() iterate more nodes in c->btree_list in reverse order. This patch just reaps the selected btree node cache, and not move it from the head to the tail of c->btree_cache list. Then bch_mca_scan() will not mess up c->btree_cache list to btree_flush_write(). Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan()Coly Li2020-01-231-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to skip the most recently freed btree node cahce, currently in bch_mca_scan() the first 3 caches in c->btree_cache_freeable list are skipped when shrinking bcache node caches in bch_mca_scan(). The related code in bch_mca_scan() is, 737 list_for_each_entry_safe(b, t, &c->btree_cache_freeable, list) { 738 if (nr <= 0) 739 goto out; 740 741 if (++i > 3 && 742 !mca_reap(b, 0, false)) { lines free cache memory 746 } 747 nr--; 748 } The problem is, if virtual memory code calls bch_mca_scan() and the calculated 'nr' is 1 or 2, then in the above loop, nothing will be shunk. In such case, if slub/slab manager calls bch_mca_scan() for many times with small scan number, it does not help to shrink cache memory and just wasts CPU cycles. This patch just selects btree node caches from tail of the c->btree_cache_freeable list, then the newly freed host cache can still be allocated by mca_alloc(), and at least 1 node can be shunk. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: remove member accessed from struct btreeColy Li2020-01-232-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The member 'accessed' of struct btree is used in bch_mca_scan() when shrinking btree node caches. The original idea is, if b->accessed is set, clean it and look at next btree node cache from c->btree_cache list, and only shrink the caches whose b->accessed is cleaned. Then only cold btree node cache will be shrunk. But when I/O pressure is high, it is very probably that b->accessed of a btree node cache will be set again in bch_btree_node_get() before bch_mca_scan() selects it again. Then there is no chance for bch_mca_scan() to shrink enough memory back to slub or slab system. This patch removes member accessed from struct btree, then once a btree node ache is selected, it will be immediately shunk. By this change, bch_mca_scan() may release btree node cahce more efficiently. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: print written and keys in trace_bcache_btree_writeGuoju Fang2020-01-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | It's useful to dump written block and keys on btree write, this patch add them into trace_bcache_btree_write. Signed-off-by: Guoju Fang <fangguoju@gmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: avoid unnecessary btree nodes flushing in btree_flush_write()Coly Li2020-01-231-5/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the commit 91be66e1318f ("bcache: performance improvement for btree_flush_write()") was an effort to flushing btree node with oldest btree node faster in following methods, - Only iterate dirty btree nodes in c->btree_cache, avoid scanning a lot of clean btree nodes. - Take c->btree_cache as a LRU-like list, aggressively flushing all dirty nodes from tail of c->btree_cache util the btree node with oldest journal entry is flushed. This is to reduce the time of holding c->bucket_lock. Guoju Fang and Shuang Li reported that they observe unexptected extra write I/Os on cache device after applying the above patch. Guoju Fang provideed more detailed diagnose information that the aggressive btree nodes flushing may cause 10x more btree nodes to flush in his workload. He points out when system memory is large enough to hold all btree nodes in memory, c->btree_cache is not a LRU-like list any more. Then the btree node with oldest journal entry is very probably not- close to the tail of c->btree_cache list. In such situation much more dirty btree nodes will be aggressively flushed before the target node is flushed. When slow SATA SSD is used as cache device, such over- aggressive flushing behavior will cause performance regression. After spending a lot of time on debug and diagnose, I find the real condition is more complicated, aggressive flushing dirty btree nodes from tail of c->btree_cache list is not a good solution. - When all btree nodes are cached in memory, c->btree_cache is not a LRU-like list, the btree nodes with oldest journal entry won't be close to the tail of the list. - There can be hundreds dirty btree nodes reference the oldest journal entry, before flushing all the nodes the oldest journal entry cannot be reclaimed. When the above two conditions mixed together, a simply flushing from tail of c->btree_cache list is really NOT a good idea. Fortunately there is still chance to make btree_flush_write() work better. Here is how this patch avoids unnecessary btree nodes flushing, - Only acquire c->journal.lock when getting oldest journal entry of fifo c->journal.pin. In rested locations check the journal entries locklessly, so their values can be changed on other cores in parallel. - In loop list_for_each_entry_safe_reverse(), checking latest front point of fifo c->journal.pin. If it is different from the original point which we get with locking c->journal.lock, it means the oldest journal entry is reclaim on other cores. At this moment, all selected dirty nodes recorded in array btree_nodes[] are all flushed and clean on other CPU cores, it is unncessary to iterate c->btree_cache any longer. Just quit the list_for_each_entry_safe_reverse() loop and the following for-loop will skip all the selected clean nodes. - Find a proper time to quit the list_for_each_entry_safe_reverse() loop. Check the refcount value of orignial fifo front point, if the value is larger than selected node number of btree_nodes[], it means more matching btree nodes should be scanned. Otherwise it means no more matching btee nodes in rest of c->btree_cache list, the loop can be quit. If the original oldest journal entry is reclaimed and fifo front point is updated, the refcount of original fifo front point will be 0, then the loop will be quit too. - Not hold c->bucket_lock too long time. c->bucket_lock is also required for space allocation for cached data, hold it for too long time will block regular I/O requests. When iterating list c->btree_cache, even there are a lot of maching btree nodes, in order to not holding c->bucket_lock for too long time, only BTREE_FLUSH_NR nodes are selected and to flush in following for-loop. With this patch, only btree nodes referencing oldest journal entry are flushed to cache device, no aggressive flushing for unnecessary btree node any more. And in order to avoid blocking regluar I/O requests, each time when btree_flush_write() called, at most only BTREE_FLUSH_NR btree nodes are selected to flush, even there are more maching btree nodes in list c->btree_cache. At last, one more thing to explain: Why it is safe to read front point of c->journal.pin without holding c->journal.lock inside the list_for_each_entry_safe_reverse() loop ? Here is my answer: When reading the front point of fifo c->journal.pin, we don't need to know the exact value of front point, we just want to check whether the value is different from the original front point (which is accurate value because we get it while c->jouranl.lock is held). For such purpose, it works as expected without holding c->journal.lock. Even the front point is changed on other CPU core and not updated to local core, and current iterating btree node has identical journal entry local as original fetched fifo front point, it is still safe. Because after holding mutex b->write_lock (with memory barrier) this btree node can be found as clean and skipped, the loop will quite latter when iterate on next node of list c->btree_cache. Fixes: 91be66e1318f ("bcache: performance improvement for btree_flush_write()") Reported-by: Guoju Fang <fangguoju@gmail.com> Reported-by: Shuang Li <psymon@bonuscloud.io> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: add code comments for state->pool in __btree_sort()Coly Li2020-01-231-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | To explain the pages allocated from mempool state->pool can be swapped in __btree_sort(), because state->pool is a page pool, which allocates pages by alloc_pages() indeed. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | lib: crc64: include <linux/crc64.h> for 'crc64_be'Ben Dooks (Codethink)2020-01-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The crc64_be() is declared in <linux/crc64.h> so include this where the symbol is defined to avoid the following warning: lib/crc64.c:43:12: warning: symbol 'crc64_be' was not declared. Should it be static? Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: use read_cache_page_gfp to read the superblockChristoph Hellwig2020-01-232-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid a pointless dependency on buffer heads in bcache by simply open coding reading a single page. Also add a SB_OFFSET define for the byte offset of the superblock instead of using magic numbers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: store a pointer to the on-disk sb in the cache and cached_dev structuresChristoph Hellwig2020-01-232-19/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows to properly build the superblock bio including the offset in the page using the normal bio helpers. This fixes writing the superblock for page sizes larger than 4k where the sb write bio would need an offset in the bio_vec. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: return a pointer to the on-disk sb from read_superChristoph Hellwig2020-01-231-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Returning the properly typed actual data structure insteaf of the containing struct page will save the callers some work going forward. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: transfer the sb_page reference to register_{bdev,cache}Christoph Hellwig2020-01-231-15/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid an extra reference count roundtrip by transferring the sb_page ownership to the lower level register helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: fix use-after-free in register_bcache()Coly Li2020-01-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch "bcache: rework error unwinding in register_bcache" introduces a use-after-free regression in register_bcache(). Here are current code, 2510 out_free_path: 2511 kfree(path); 2512 out_module_put: 2513 module_put(THIS_MODULE); 2514 out: 2515 pr_info("error %s: %s", path, err); 2516 return ret; If some error happens and the above code path is executed, at line 2511 path is released, but referenced at line 2515. Then KASAN reports a use- after-free error message. This patch changes line 2515 in the following way to fix the problem, 2515 pr_info("error %s: %s", path?path:"", err); Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: properly initialize 'path' and 'err' in register_bcache()Coly Li2020-01-231-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch "bcache: rework error unwinding in register_bcache" from Christoph Hellwig changes the local variables 'path' and 'err' in undefined initial state. If the code in register_bcache() jumps to label 'out:' or 'out_module_put:' by goto, these two variables might be reference with undefined value by the following line, out_module_put: module_put(THIS_MODULE); out: pr_info("error %s: %s", path, err); return ret; Therefore this patch initializes these two local variables properly in register_bcache() to avoid such issue. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: rework error unwinding in register_bcacheChristoph Hellwig2020-01-231-30/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the successful and error return path, and use one goto label for each resource to unwind. This also fixes some small errors like leaking the module reference count in the reboot case (which seems entirely harmless) or printing the wrong warning messages for early failures. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: use a separate data structure for the on-disk super blockChristoph Hellwig2020-01-232-3/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split out an on-disk version struct cache_sb with the proper endianness annotations. This fixes a fair chunk of sparse warnings, but there are some left due to the way the checksum is defined. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: cached_dev_free needs to put the sb pageLiang Chen2020-01-231-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Same as cache device, the buffer page needs to be put while freeing cached_dev. Otherwise a page would be leaked every time a cached_dev is stopped. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | Merge branch 'md-next' of ↵Jens Axboe2020-01-148-175/+353
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.6/drivers Pull MD changes from Song. * 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid1: introduce wait_for_serialization md/raid1: use bucket based mechanism for IO serialization md: introduce a new struct for IO serialization md: don't destroy serial_info_pool if serialize_policy is true raid1: serialize the overlap write md: reorgnize mddev_create/destroy_serial_pool md: add serialize_policy sysfs node for raid1 md: prepare for enable raid1 io serialization md: fix a typo s/creat/create md: rename wb stuffs raid5: remove worker_cnt_per_group argument from alloc_thread_groups md/raid6: fix algorithm choice under larger PAGE_SIZE raid6/test: fix a compilation warning raid6/test: fix a compilation error md-bitmap: small cleanups
| | * | md/raid1: introduce wait_for_serializationGuoqing Jiang2020-01-131-19/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, we call check_and_add_serial when serialization is enabled for write IO, but it could allocate and free memory back and forth. Now, let's just get an element from memory pool with the new function, then insert node to rb tree if no collision happens. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md/raid1: use bucket based mechanism for IO serializationGuoqing Jiang2020-01-132-8/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since raid1 had already used bucket based mechanism to reduce the conflict between write IO and resync IO, it is possible to speed up performance for io serialization with refer to the same mechanism. To align with the barrier bucket mechanism, we created arrays (with the same number of BARRIER_BUCKETS_NR) for spinlock, rb tree and waitqueue. Then we can reduce lock competition with multiple spinlocks, boost search performance with multiple rb trees and also reduce thundering herd problem with multiple waitqueues. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md: introduce a new struct for IO serializationGuoqing Jiang2020-01-134-63/+116
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Obviously, IO serialization could cause the degradation of performance a lot. In order to reduce the degradation, so a rb interval tree is added in raid1 to speed up the check of collision. So, a rb root is needed in md_rdev, then abstract all the serialize related members to a new struct (serial_in_rdev), embed it into md_rdev. Of course, we need to free the struct if it is not needed anymore, so rdev/rdevs_uninit_serial are added accordingly. And they should be called when destroty memory pool or can't alloc memory. And we need to consider to call mddev_destroy_serial_pool in case serialize_policy/write-behind is disabled, bitmap is destroyed or in __md_stop_writes. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md: don't destroy serial_info_pool if serialize_policy is trueGuoqing Jiang2020-01-131-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The serial_info_pool is needed if array sets serialize_policy to true, so don't destroy it. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | raid1: serialize the overlap writeGuoqing Jiang2020-01-131-14/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before dispatch write bio, raid1 array which enables serialize_policy need to check if overlap exists between this bio and previous on-flying bios. If there is overlap, then it has to wait until the collision is disappeared. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md: reorgnize mddev_create/destroy_serial_poolGuoqing Jiang2020-01-131-29/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far, IO serialization is used for two scenarios: 1. raid1 which enables write-behind mode, and there is rdev in the array which is multi-queue device and flaged with writemostly. 2. IO serialization is enabled or disabled by change serialize_policy. So introduce rdev_need_serial to check the first scenario. And for 1, IO serialization is enabled automatically while 2 is controlled manually. And it is possible that both scenarios are true, so for create serial pool, rdev/rdevs_init_serial should be separate from check if the pool existed or not. Then for destroy pool, we need to check if the pool is needed by other rdevs due to the first scenario. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md: add serialize_policy sysfs node for raid1Guoqing Jiang2020-01-132-0/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the new sysfs node, we can use it to control if raid1 array wants io serialization or not. So mddev_create_serial_pool and mddev_destroy_serial_pool are called in serialize_policy_store to enable or disable the serialization. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md: prepare for enable raid1 io serializationGuoqing Jiang2020-01-132-21/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. The related resources (spin_lock, list and waitqueue) are needed for address raid1 reorder overlap issue too, in this case, rdev is set to NULL for mddev_create/destroy_serial_pool which implies all rdevs need to handle these resources. And also add "is_suspend" to mddev_destroy_serial_pool since it will be called under suspended situation, which also makes both create and destroy pool have same arguments. 2. Introduce rdevs_init_serial which is called if raid1 io serialization is enabled since all rdevs need to init related stuffs. 3. rdev_init_serial and clear_bit(CollisionCheck, &rdev->flags) should be called between suspend and resume. No need to export mddev_create_serial_pool since it is only called in md-mod module. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md: fix a typo s/creat/createGuoqing Jiang2020-01-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It actually means create here, so fix the typo. Reported-by: Song Liu <liu.song.a23@gmail.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md: rename wb stuffsGuoqing Jiang2020-01-134-77/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, wb_info_pool and wb_list stuffs are introduced to address potential data inconsistence issue for write behind device. Now rename them to serial related name, since the same mechanism will be used to address reorder overlap write issue for raid1. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | raid5: remove worker_cnt_per_group argument from alloc_thread_groupsGuoqing Jiang2020-01-131-14/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can use "cnt" directly to update conf->worker_cnt_per_group if alloc_thread_groups returns 0. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md/raid6: fix algorithm choice under larger PAGE_SIZEZhengyuan Liu2020-01-132-23/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several algorithms available for raid6 to generate xor and syndrome parity, including basic int1, int2 ... int32 and SIMD optimized implementation like sse and neon. To test and choose the best algorithms at the initial stage, we need provide enough disk data to feed the algorithms. However, the disk number we provided depends on page size and gfmul table, seeing bellow: const int disks = (65536/PAGE_SIZE) + 2; So when come to 64K PAGE_SIZE, there is only one data disk plus 2 parity disk, as a result the chosed algorithm is not reliable. For example, on my arm64 machine with 64K page enabled, it will choose intx32 as the best one, although the NEON implementation is better. This patch tries to fix the problem by defining a constant raid6 disk number to supporting arbitrary page size. Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | raid6/test: fix a compilation warningZhengyuan Liu2020-01-132-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The compilation warning is redefination showed as following: In file included from tables.c:2: ../../../include/linux/export.h:180: warning: "EXPORT_SYMBOL" redefined #define EXPORT_SYMBOL(sym) __EXPORT_SYMBOL(sym, "") In file included from tables.c:1: ../../../include/linux/raid/pq.h:61: note: this is the location of the previous definition #define EXPORT_SYMBOL(sym) Fixes: 69a94abb82ee ("export.h, genksyms: do not make genksyms calculate CRC of trimmed symbols") Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | raid6/test: fix a compilation errorZhengyuan Liu2020-01-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The compilation error is redeclaration showed as following: In file included from ../../../include/linux/limits.h:6, from /usr/include/x86_64-linux-gnu/bits/local_lim.h:38, from /usr/include/x86_64-linux-gnu/bits/posix1_lim.h:161, from /usr/include/limits.h:183, from /usr/lib/gcc/x86_64-linux-gnu/8/include-fixed/limits.h:194, from /usr/lib/gcc/x86_64-linux-gnu/8/include-fixed/syslimits.h:7, from /usr/lib/gcc/x86_64-linux-gnu/8/include-fixed/limits.h:34, from ../../../include/linux/raid/pq.h:30, from algos.c:14: ../../../include/linux/types.h:114:15: error: conflicting types for ‘int64_t’ typedef s64 int64_t; ^~~~~~~ In file included from /usr/include/stdint.h:34, from /usr/lib/gcc/x86_64-linux-gnu/8/include/stdint.h:9, from /usr/include/inttypes.h:27, from ../../../include/linux/raid/pq.h:29, from algos.c:14: /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: note: previous \ declaration of ‘int64_t’ was here typedef __int64_t int64_t; Fixes: 54d50897d544 ("linux/kernel.h: split *_MAX and *_MIN macros into <linux/limits.h>") Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com>
| | * | md-bitmap: small cleanupsZhiqiang Liu2020-01-131-3/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | In md_bitmap_unplug, bitmap->storage.filemap is double checked. In md_bitmap_daemon_work, bitmap->storage.filemap should be checked before reference. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* | | Merge tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-blockLinus Torvalds2020-01-2711-57/+131
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block updates from Jens Axboe: "This may be the most quiet round we've had in years. I'm not complaining. Really not a lot to detail here, outside of spelling and documentation improvements/fixes, we have: - Allow t10-pi to be modular (Herbert) - Remove dead code in bfq (Alex) - Mark zone management requests with REQ_SYNC (Chaitanya) - BFQ division improvement (Wen) - Small series improving plugging (Pavel)" * tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block: partitions/ldm: fix spelling mistake "to" -> "too" block, bfq: improve arithmetic division in bfq_delta() block/bfq: remove unused bfq_class_rt which never used block: mark zone-mgmt bios with REQ_SYNC blk-mq: Document functions for sending request block: Allow t10-pi to be modular blk-mq: optimise blk_mq_flush_plug_list() list: introduce list_for_each_continue() blk-mq: optimise rq sort function
| * | | partitions/ldm: fix spelling mistake "to" -> "too"Colin Ian King2020-01-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a spelling mistake in a ldm_error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block, bfq: improve arithmetic division in bfq_delta()Wen Yang2020-01-221-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_div() does a 64-by-32 division. Use div64_ul() instead of it if the divisor is unsigned long, to avoid truncation to 32-bit. And as a nice side effect also cleans up the function a bit. Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@fb.com> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block/bfq: remove unused bfq_class_rt which never usedAlex Shi2020-01-221-1/+0
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This macro is never used after introduced from commit aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Better to remove it. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: mark zone-mgmt bios with REQ_SYNCChaitanya Kulkarni2020-01-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the current implementation, final zone-mgmt request is issued with submit_bio_wait() which marks the bio REQ_SYNC. This is needed since immediate action is expected for zone-mgmt requests as these are blocking operations. This also bypasses the scheduler in the blk_mq_make_request() and dispatches the request directly into the hw ctx. This patch marks all the chained bios REQ_SYNC so that we can have above-mentioned behavior for non-final bios also. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq: Document functions for sending requestAndré Almeida2020-01-071-2/+83
| | | | | | | | | | | | | | | | | | | | | | | | Add or improve documentation for function regarding creating and sending IO requests to the hardware. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>