summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-4.5/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2016-01-2233-1326/+3893
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "This is the block driver pull request for 4.5, with the exception of NVMe, which is in a separate branch and will be posted after this one. This pull request contains: - A set of bcache stability fixes, which have been acked by Kent. These have been used and tested for more than a year by the community, so it's about time that they got in. - A set of drbd updates from the drbd team (Andreas, Lars, Philipp) and Markus Elfring, Oleg Drokin. - A set of fixes for xen blkback/front from the usual suspects, (Bob, Konrad) as well as community based fixes from Kiri, Julien, and Peng. - A 2038 time fix for sx8 from Shraddha, with a fix from me. - A small mtip32xx cleanup from Zhu Yanjun. - A null_blk division fix from Arnd" * 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits) null_blk: use sector_div instead of do_div mtip32xx: restrict variables visible in current code module xen/blkfront: Fix crash if backend doesn't follow the right states. xen/blkback: Fix two memory leaks. xen/blkback: make st_ statistics per ring xen/blkfront: Handle non-indirect grant with 64KB pages xen-blkfront: Introduce blkif_ring_get_request xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule() xen/blkback: Free resources if connect_ring failed. xen/blocks: Return -EXX instead of -1 xen/blkback: make pool of persistent grants and free pages per-queue xen/blkback: get the number of hardware queues/rings from blkfront xen/blkback: pseudo support for multi hardware queues/rings xen/blkback: separate ring information out of struct xen_blkif xen/blkfront: correct setting for xen_blkif_max_ring_order xen/blkfront: make persistent grants pool per-queue xen/blkfront: Remove duplicate setting of ->xbdev. xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors. xen/blkfront: negotiate number of queues/rings to be used with backend xen/blkfront: split per device io_lock ...
| * null_blk: use sector_div instead of do_divArnd Bergmann2016-01-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dividing a sector_t number should be done using sector_div rather than do_div to optimize the 32-bit sector_t case, and with the latest do_div optimizations, we now get a compile-time warning for this: arch/arm/include/asm/div64.h:32:95: note: expected 'uint64_t * {aka long long unsigned int *}' but argument is of type 'sector_t * {aka long unsigned int *}' drivers/block/null_blk.c:521:81: warning: comparison of distinct pointer types lacks a cast This changes the newly added code to use sector_div. It is a simplified version of the original patch, as Linus Torvalds pointed out that we should not be using an expensive division function in the first place. This version was suggested by Matias Bjorling. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Matias Bjorling <m@bjorling.me> Fixes: b2b7e00148a2 ("null_blk: register as a LightNVM device") Signed-off-by: Jens Axboe <axboe@fb.com>
| * Merge branch 'stable/for-jens-4.5' of ↵Jens Axboe2016-01-135-710/+1292
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-4.5/drivers Konrad writes: The pull is based on converting the backend driver into an multiqueue driver and exposing more than one queue to the frontend. As such we had to modify the frontend and also fix a bunch of bugs around this. The original work is based on Arianna Avanzini's work as an OPW intern. Bob took over the work and had been massaging it for quite some time. Also included are are features to 64KB page support for ARM and various bug-fixes.
| | * xen/blkfront: Fix crash if backend doesn't follow the right states.Konrad Rzeszutek Wilk2016-01-041-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have split the setting up of all the resources in two steps: 1) talk_to_blkback - which figures out the num_ring_pages (from the default value of zero), sets up shadow and so 2) blkfront_connect - does the real part of filling out the internal structures. The problem is if we bypass the 1) step and go straight to 2) and call blkfront_setup_indirect where we use the macro BLK_RING_SIZE - which returns an negative value (because sz is zero - since num_ring_pages is zero - since it has never been set). We can fix this by making sure that we always have called talk_to_blkback before going to blkfront_connect. Or we could set in blkfront_probe info->nr_ring_pages = 1 to have a default value. But that looks odd - as we haven't actually negotiated any ring size. This patch changes XenbusStateConnected state to detect if we haven't done the initial handshake - and if so continue on as if were in XenbusStateInitWait state. We also roll the error recovery (freeing the structure) into talk_to_blkback error path - which is safe since that function is only called from blkback_changed. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: Fix two memory leaks.Bob Liu2016-01-041-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixs two memleaks: backtrace: [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50 [<ffffffff81205e3b>] kmem_cache_alloc+0xbb/0x1d0 [<ffffffff81534028>] xen_blkbk_probe+0x58/0x230 [<ffffffff8146adb6>] xenbus_dev_probe+0x76/0x130 [<ffffffff81511716>] driver_probe_device+0x166/0x2c0 [<ffffffff815119bc>] __device_attach_driver+0xac/0xb0 [<ffffffff8150fa57>] bus_for_each_drv+0x67/0x90 [<ffffffff81511ab7>] __device_attach+0xc7/0x120 [<ffffffff81511b23>] device_initial_probe+0x13/0x20 [<ffffffff8151059a>] bus_probe_device+0x9a/0xb0 [<ffffffff8150f0a1>] device_add+0x3b1/0x5c0 [<ffffffff8150f47e>] device_register+0x1e/0x30 [<ffffffff8146a9e8>] xenbus_probe_node+0x158/0x170 [<ffffffff8146abaf>] xenbus_dev_changed+0x1af/0x1c0 [<ffffffff8146b1bb>] backend_changed+0x1b/0x20 [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160 unreferenced object 0xffff880007ba8ef8 (size 224): backtrace: [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50 [<ffffffff81205c73>] __kmalloc+0xd3/0x1e0 [<ffffffff81534d87>] frontend_changed+0x2c7/0x580 [<ffffffff8146af12>] xenbus_otherend_changed+0xa2/0xb0 [<ffffffff8146b2c0>] frontend_changed+0x10/0x20 [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160 [<ffffffff810d3e97>] kthread+0xd7/0xf0 [<ffffffff817c4a9f>] ret_from_fork+0x3f/0x70 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff8800048dcd38 (size 224): The first leak is caused by not put() the be->blkif reference which we had gotten in xen_blkif_alloc(), while the second is us not freeing blkif->rings in the right place. Signed-off-by: Bob Liu <bob.liu@oracle.com> Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: make st_ statistics per ringBob Liu2016-01-043-38/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make st_* statistics per ring and the VBD sysfs would iterate over all the rings. Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings are torn down, so it's safe. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- v2: Aligned the variables on the same column.
| | * xen/blkfront: Handle non-indirect grant with 64KB pagesJulien Grall2016-01-041-16/+212
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The minimal size of request in the block framework is always PAGE_SIZE. It means that when 64KB guest is support, the request will at least be 64KB. Although, if the backend doesn't support indirect descriptor (such as QDISK in QEMU), a ring request is only able to accommodate 11 segments of 4KB (i.e 44KB). The current frontend is assuming that an I/O request will always fit in a ring request. This is not true any more when using 64KB page granularity and will therefore crash during boot. On ARM64, the ABI is completely neutral to the page granularity used by the domU. The guest has the choice between different page granularity supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB). This can't be enforced by the hypervisor and therefore it's possible to run guests using different page granularity. So we can't mandate the block backend to support indirect descriptor when the frontend is using 64KB page granularity and have to fix it properly in the frontend. The solution exposed below is based on modifying directly the frontend guest rather than asking the block framework to support smaller size (i.e < PAGE_SIZE). This is because the change is the block framework are not trivial as everything seems to relying on a struct *page (see [1]). Although, it may be possible that someone succeed to do it in the future and we would therefore be able to use it. Given that a block request may not fit in a single ring request, a second request is introduced for the data that cannot fit in the first one. This means that the second ring request should never be used on Linux if the page size is smaller than 44KB. To achieve the support of the extra ring request, the block queue size is divided by two. Therefore, the ring will always contain enough space to accommodate 2 ring requests. While this will reduce the overall performance, it will make the implementation more contained. The way forward to get better performance is to implement in the backend either indirect descriptor or multiple grants ring. Note that the parameters blk_queue_max_* helpers haven't been updated. The block code will set the mimimum size supported and we may be able to support directly any change in the block framework that lower down the minimal size of a request. [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen-blkfront: Introduce blkif_ring_get_requestJulien Grall2016-01-041-11/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | The code to get a request is always the same. Therefore we can factorize it in a single function. Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()Jiri Kosina2016-01-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xen_blkif_schedule() kthread calls try_to_freeze() at the beginning of every attempt to purge the LRU. This operation can't ever succeed though, as the kthread hasn't marked itself as freezable. Before (hopefully eventually) kthread freezing gets converted to fileystem freezing, we'd rather mark xen_blkif_schedule() freezable (as it can generate I/O during suspend). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: Free resources if connect_ring failed.Konrad Rzeszutek Wilk2016-01-041-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the multi-queue support we could fail at setting up some of the rings and fail the connection. That meant that all resources tied to rings[0..n-1] (where n is the ring that failed to be setup). Eventually the frontend will switch to the states and we will call xen_blkif_disconnect. However we do not want to be at the mercy of the frontend deciding when to change states. This allows us to do the cleanup right away and freeing resources. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blocks: Return -EXX instead of -1Konrad Rzeszutek Wilk2016-01-042-3/+3
| | | | | | | | | | | | | | | | | | Lets return sensible values instead of -1. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: make pool of persistent grants and free pages per-queueBob Liu2016-01-043-137/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make pool of persistent grants and free pages per-queue/ring instead of per-device to get better scalability. Test was done based on null_blk driver: dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk" domu: v4.2-rc8 16vcpus 10GB [test] rw=read direct=1 ioengine=libaio bs=4k time_based runtime=30 filename=/dev/xvdb numjobs=16 iodepth=64 iodepth_batch=64 iodepth_batch_complete=64 group_reporting Results: iops1: After patch "xen/blkfront: make persistent grants per-queue". iops2: After this patch. Queues: 1 4 8 16 Iops orig(k): 810 1064 780 700 Iops1(k): 810 1230(~20%) 1024(~20%) 850(~20%) Iops2(k): 810 1410(~35%) 1354(~75%) 1440(~100%) With 4 queues after this commit we can get ~75% increase in IOPS, and performance won't drop if increasing queue numbers. Please find the respective chart in this link: https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0 Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: get the number of hardware queues/rings from blkfrontBob Liu2016-01-043-6/+42
| | | | | | | | | | | | | | | | | | | | | | | | Backend advertises "multi-queue-max-queues" to front, also get the negotiated number from "multi-queue-num-queues" written by blkfront. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: pseudo support for multi hardware queues/ringsKonrad Rzeszutek Wilk2016-01-042-105/+175
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preparatory patch for multiple hardware queues (rings). The number of rings is unconditionally set to 1, larger number will be enabled in "xen/blkback: get the number of hardware queues/rings from blkfront". Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- v2: Align variables in the structures.
| | * xen/blkback: separate ring information out of struct xen_blkifBob Liu2016-01-043-172/+215
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split per ring information to an new structure "xen_blkif_ring", so that one vbd device can be associated with one or more rings/hardware queues. Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we may have multi backend threads. This patch is a preparation for supporting multi hardware queues/rings. Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- v2: Align the variables in the structure.
| | * xen/blkfront: correct setting for xen_blkif_max_ring_orderPeng Fan2016-01-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to this piece code: " pr_info("Invalid max_ring_order (%d), will use default max: %d.\n", xen_blkif_max_ring_order, XENBUS_MAX_RING_GRANT_ORDER); " if xen_blkif_max_ring_order is bigger that XENBUS_MAX_RING_GRANT_ORDER, need to set xen_blkif_max_ring_order using XENBUS_MAX_RING_GRANT_ORDER, but not 0. Signed-off-by: Peng Fan <van.freenix@gmail.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkfront: make persistent grants pool per-queueBob Liu2016-01-041-67/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make persistent grants per-queue/ring instead of per-device, so that we can drop the 'dev_lock' and get better scalability. Test was done based on null_blk driver: dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk" domu: v4.2-rc8 16vcpus 10GB [test] rw=read direct=1 ioengine=libaio bs=4k time_based runtime=30 filename=/dev/xvdb numjobs=16 iodepth=64 iodepth_batch=64 iodepth_batch_complete=64 group_reporting Queues: 1 4 8 16 Iops orig(k): 810 1064 780 700 Iops patched(k): 810 1230(~20%) 1024(~20%) 850(~20%) Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkfront: Remove duplicate setting of ->xbdev.Bob Liu2016-01-041-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | We do the same exact operations a bit earlier in the function. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.Konrad Rzeszutek Wilk2016-01-041-7/+6
| | | | | | | | | | | | Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkfront: negotiate number of queues/rings to be used with backendBob Liu2016-01-041-41/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The max number of hardware queues for xen/blkfront is set by parameter 'max_queues'(default 4), while it is also capped by the max value that the xen/blkback exposes through XenStore key 'multi-queue-max-queues'. The negotiated number is the smaller one and would be written back to xenstore as "multi-queue-num-queues", blkback needs to read this negotiated number. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkfront: split per device io_lockBob Liu2016-01-041-26/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After patch "xen/blkfront: separate per ring information out of device info", per-ring data is protected by a per-device lock ('io_lock'). This is not a good way and will effect the scalability, so introduce a per-ring lock ('ring_lock'). The old 'io_lock' is renamed to 'dev_lock' which protects the ->grants list and ->persistent_gnts_c which are shared by all rings. Note that in 'blkfront_probe' the 'blkfront_info' is setup via kzalloc so setting ->persistent_gnts_c to zero is not needed. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkfront: pseudo support for multi hardware queues/ringsBob Liu2016-01-041-145/+198
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preparatory patch for multiple hardware queues (rings). The number of rings is unconditionally set to 1, larger number will be enabled in patch "xen/blkfront: negotiate number of queues/rings to be used with backend" so as to make review easier. Note that blkfront_gather_backend_features does not call blkfront_setup_indirect anymore (as that needs to be done per ring). That means that in blkif_recover/blkif_connect we have to do it in a loop (bounded by nr_rings). Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkfront: separate per ring information out of device infoBob Liu2016-01-041-162/+197
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split per ring information to a new structure "blkfront_ring_info". A ring is the representation of a hardware queue, every vbd device can associate with one or more rings depending on how many hardware queues/rings to be used. This patch is a preparation for supporting real multi hardware queues/rings. We also add a backpointer to 'struct blkfront_info' (dev_info) which is not needed (we could use containers_of) but further patch ("xen/blkfront: pseudo support for multi hardware queues/rings") will make allocation of 'blkfront_ring_info' dynamic. Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkif: document blkif multi-queue/ring extensionBob Liu2016-01-041-0/+48
| | | | | | | | | | | | | | | | | | | | | | | | Document the multi-queue/ring feature in terms of XenStore keys to be written by the backend and by the frontend. Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | mtip32xx: restrict variables visible in current code moduleZhu Yanjun2016-01-081-3/+3
| |/ | | | | | | | | | | | | | | | | The modified variables are only used in the file mtip32xx.c. As such, the static keyword is inserted to define that object to be only visible to the current code module during compilation. Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * bcache: Change refill_dirty() to always scan entire disk if necessaryKent Overstreet2015-12-311-7/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * bcache: prevent crash on changing writeback_runningStefan Bader2015-12-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | Added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * bcache: allows use of register in udev to avoid "device_busy" error.Gabriel de Perthuis2015-12-311-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allows to use register, not register_quiet in udev to avoid "device_busy" error. The initial patch proposed at https://lkml.org/lkml/2013/8/26/549 by Gabriel de Perthuis <g2p.code@gmail.com> does not unlock the mutex and hangs the kernel. See http://thread.gmane.org/gmane.linux.kernel.bcache.devel/2594 for the discussion. Cc: Denis Bychkov <manover@gmail.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Gabriel de Perthuis <g2p.code@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * bcache: unregister reboot notifier if bcache fails to unregister deviceZheng Liu2015-12-311-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | In bcache_init() function it forgot to unregister reboot notifier if bcache fails to unregister a block device. This commit fixes this. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * bcache: fix a leak in bch_cached_dev_run()Al Viro2015-12-311-1/+4
| | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * bcache: clear BCACHE_DEV_UNLINK_DONE flag when attaching a backing deviceZheng Liu2015-12-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This bug can be reproduced by the following script: #!/bin/bash bcache_sysfs="/sys/fs/bcache" function clear_cache() { if [ ! -e $bcache_sysfs ]; then echo "no bcache sysfs" exit fi cset_uuid=$(ls -l $bcache_sysfs|head -n 2|tail -n 1|awk '{print $9}') sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/detach" sleep 5 sudo sh -c "echo $cset_uuid > /sys/block/sdb/sdb1/bcache/attach" } for ((i=0;i<10;i++)); do clear_cache done The warning messages look like below: [ 275.948611] ------------[ cut here ]------------ [ 275.963840] WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xb8/0xd0() (Tainted: P W --------------- ) [ 275.979253] Hardware name: Tecal RH2285 [ 275.994106] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:09.0/0000:08:00.0/host4/target4:2:1/4:2:1:0/block/sdb/sdb1/bcache/cache' [ 276.024105] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.072643] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.089315] Call Trace: [ 276.105801] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.122650] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.139361] [<ffffffff81205c08>] ? sysfs_add_one+0xb8/0xd0 [ 276.156012] [<ffffffff8120609b>] ? sysfs_do_create_link+0x12b/0x170 [ 276.172682] [<ffffffff81206113>] ? sysfs_create_link+0x13/0x20 [ 276.189282] [<ffffffffa03bda21>] ? bcache_device_link+0xc1/0x110 [bcache] [ 276.205993] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.222794] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.239680] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.256594] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.273364] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.290133] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.306368] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.322301] ---[ end trace 9f5d4fcdd0c3edfb ]--- [ 276.338241] ------------[ cut here ]------------ [ 276.354109] WARNING: at /home/wenqing.lz/bcache/bcache/super.c:720 bcache_device_link+0xdf/0x110 [bcache]() (Tainted: P W --------------- ) [ 276.386017] Hardware name: Tecal RH2285 [ 276.401430] Couldn't create device <-> cache set symlinks [ 276.401759] Modules linked in: bcache tcp_diag inet_diag ipmi_devintf ipmi_si ipmi_msghandler bonding 8021q garp stp llc ipv6 ext3 jbd loop sg iomemory_vsl(P) bnx2 microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 jbd2 mbcache megaraid_sas pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] [ 276.465477] Pid: 2765, comm: sh Tainted: P W --------------- 2.6.32 #1 [ 276.482169] Call Trace: [ 276.498610] [<ffffffff81070fe7>] ? warn_slowpath_common+0x87/0xc0 [ 276.515405] [<ffffffff810710d6>] ? warn_slowpath_fmt+0x46/0x50 [ 276.532059] [<ffffffffa03bda3f>] ? bcache_device_link+0xdf/0x110 [bcache] [ 276.548808] [<ffffffffa03bfa08>] ? bch_cached_dev_attach+0x478/0x4f0 [bcache] [ 276.565569] [<ffffffffa03c4a17>] ? bch_cached_dev_store+0x627/0x780 [bcache] [ 276.582418] [<ffffffff8116783a>] ? alloc_pages_current+0xaa/0x110 [ 276.599341] [<ffffffff81203b15>] ? sysfs_write_file+0xe5/0x170 [ 276.616142] [<ffffffff811887b8>] ? vfs_write+0xb8/0x1a0 [ 276.632607] [<ffffffff811890b1>] ? sys_write+0x51/0x90 [ 276.648671] [<ffffffff8100c072>] ? system_call_fastpath+0x16/0x1b [ 276.664756] ---[ end trace 9f5d4fcdd0c3edfc ]--- We forget to clear BCACHE_DEV_UNLINK_DONE flag in bcache_device_attach() function when we attach a backing device first time. After detaching this backing device, this flag will be true and sysfs_remove_link() isn't called in bcache_device_unlink(). Then when we attach this backing device again, sysfs_create_link() will return EEXIST error in bcache_device_link(). So the fix is trival and we clear this flag in bcache_device_link(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * bcache: Add a cond_resched() call to gcKent Overstreet2015-12-311-0/+1
| | | | | | | | | | | | | | | | Signed-off-by: Takashi Iwai <tiwai@suse.de> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * bcache: fix a livelock when we cause a huge number of cache missesZheng Liu2015-12-311-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Subject : [PATCH v2] bcache: fix a livelock in btree lock Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM) This commit tries to fix a livelock in bcache. This livelock might happen when we causes a huge number of cache misses simultaneously. When we get a cache miss, bcache will execute the following path. ->cached_dev_make_request() ->cached_dev_read() ->cached_lookup() ->bch->btree_map_keys() ->btree_root() <------------------------ ->bch_btree_map_keys_recurse() | ->cache_lookup_fn() | ->cached_dev_cache_miss() | ->bch_btree_insert_check_key() -| [If btree->seq is not equal to seq + 1, we should return EINTR and traverse btree again.] In bch_btree_insert_check_key() function we first need to check upgrade flag (op->lock == -1), and when this flag is true we need to release read btree->lock and try to take write btree->lock. During taking and releasing this write lock, btree->seq will be monotone increased in order to prevent other threads modify this in cache miss (see btree.h:74). But if there are some cache misses caused by some requested, we could meet a livelock because btree->seq is always changed by others. Thus no one can make progress. This commit will try to take write btree->lock if it encounters a race when we traverse btree. Although it sacrifice the scalability but we can ensure that only one can modify the btree. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Joshua Schmid <jschmid@suse.com> Cc: Zhu Yanhai <zhu.yanhai@gmail.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
| * sx8: use real time for the command secondsJens Axboe2015-12-231-1/+1
| | | | | | | | | | | | | | | | | | | | Commit 8182503df1ba used monotonic time, but if the adapter is using the seconds for logging entries, then we'll get duplicate entries if the system is rebooted. Use real time instead. Reported-by: Arnd Bergmann <arnd@arndb.de> Fixes: 8182503df1ba ("block: sx8.c: Replace timeval with ktime_t") Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: sx8.c: Replace timeval with ktime_tShraddha Barke2015-12-221-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | 32-bit systems using 'struct timeval' will break in the year 2038, in order to avoid that replace the code with more appropriate types. This patch replaces timeval with 64 bit ktime_t which is y2038 safe. Since st->timestamp is only interested in seconds, directly using time64_t here. Function ktime_get_seconds is used since it uses monotonic instead of real time and thus will not cause overflow. Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: fix error path during resizeLars Ellenberg2015-11-251-30/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case the lower level device size changed, but some other internal details of the resize did not work out, drbd_determine_dev_size() would try to restore the previous settings, trusting drbd_md_set_sector_offsets() to "do the right thing", but overlooked that this internally may set the meta data base offset based on device size. This could end up with incomplete on-disk meta data layout change, and ultimately lead to data corruption (if the failure was not noticed or ignored by the operator, and other things go wrong as well). Just remember all meta data related offsets/sizes, and on error restore them all. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: avoid potential deadlock during handshakeLars Ellenberg2015-11-253-23/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During handshake communication, we also reconsider our device size, using drbd_determine_dev_size(). Just in case we need to change the offsets or layout of our on-disk metadata, we lock out application and other meta data IO, and wait for the activity log to be "idle" (no more referenced extents). If this handshake happens just after a connection loss, with a fencing policy of "resource-and-stonith", we have frozen IO. If, additionally, the activity log was "starving" (too many incoming random writes at that point in time), it won't become idle, ever, because of the frozen IO, and this would be a lockup of the receiver thread, and consquentially of DRBD. Previous logic (re-)initialized with a special "empty" transaction block, which required the activity log to fully drain first. Instead, write out some standard activity log transactions. Using lc_try_lock_for_transaction() instead of lc_try_lock() does not care about pending activity log references, avoiding the potential deadlock. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: separate out __al_write_transaction helper functionLars Ellenberg2015-11-251-148/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To be able to "force out" an activity log transaction, even if there are no pending updates. This will be used to relocate the on-disk activity log, if the on-disk offsets have to be changed, without the need to empty the activity log first. While at it, move the definition, so we can drop the forward declaration of a static helper. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: make suspend_io() / resume_io() must be thread and recursion safePhilipp Reisner2015-11-253-6/+8
| | | | | | | | | | | | | | | | | | Avoid to prematurely resume application IO: don't set/clear a single bit, but inc/dec an atomic counter. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: fix "endless" transfer log walk in protocol ALars Ellenberg2015-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't remember a DRBD request as ack_pending, if it is not. In protocol A, we usually clear RQ_NET_PENDING at the same time we set RQ_NET_SENT, so when deciding to remember it as ack_pending, mod_rq_state needs to look at the current request state, not at the previous state before the current modification was applied. This should prevent advance_conn_req_ack_pending() from walking the full transfer log just to find NULL in protocol A, which would cause serious performance degradation with many "in-flight" requests, e.g. when working via DRBD-proxy, or with a huge bandwidth-delay product. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: fix memory leak in drbd_adm_resizeOleg Drokin2015-11-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | new_disk_conf could be leaked if the follow on checks fail, so make sure to free it on error if it was not assigned yet. Found with smatch. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: don't block forever in disconnect during resync if fencing=r-a-stonithLars Ellenberg2015-11-251-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disconnect should wait for pending bitmap IO. But if that bitmap IO is not happening, because it is waiting for pending application IO, and there is no progress, because the fencing policy suspended application IO because of the disconnect, then we deadlock. The bitmap writeout in this case does not care for concurrent application IO, so there is no point waiting for it. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * lru_cache: Converted lc_seq_printf_status to return voidRoland Kammerer2015-11-252-4/+2
| | | | | | | | | | | | | | | | | | Fix the semantic of lc_seq_printf. Currently, it always returns 0 and the return value is unused, therefore, convert the return type to void. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: make drbd known to lsblk: use bd_link_disk_holderLars Ellenberg2015-11-255-52/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | lsblk should be able to pick up stacking device driver relations involving DRBD conveniently. Even though upstream kernel since 2011 says "DON'T USE THIS UNLESS YOU'RE ALREADY USING IT." a new user has been added since (bcache), which sets the precedences for us to use it as well. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: fix queue limit setup for discardLars Ellenberg2015-11-251-9/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cannot possibly support SECDISCARD, even if all backend devices would support it: if our peer is currently unreachable, some instance of the data may obviously still be recoverable. We did not set discard_granularity at all. We don't really care (yet), we only pass them on, so for now, set our granularity to one sector. blkdev_stack_limits() takes care of the rest. If we decide we cannot support discards, not only clear the (not user visible) QUEUE_FLAG_DISCARD, but set both (user visible) discard_granularity and max_discard_sectors to zero, to avoid confusion with e.g. lsblk -D. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: fix spurious alert level printkLars Ellenberg2015-11-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When accessing out meta data area on disk, we double check the plausibility of the requested sector offsets, and are very noisy about it if they look suspicious. During initial read of our "superblock", for "external" meta data, this triggered because the range estimate returned by drbd_md_last_sector() was still wrong. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: use bitmap_weight() helper, don't open codeLars Ellenberg2015-11-251-8/+8
| | | | | | | | | | | | | | | | Suggested by Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: avoid redefinition of BITS_PER_PAGELars Ellenberg2015-11-251-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | Apparently we now implicitly get definitions for BITS_PER_PAGE and BITS_PER_PAGE_MASK from the pid_namespace.h Instead of renaming our defines, I chose to define only if not yet defined, but to double check the value if already defined. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: use resource name in workqueueLars Ellenberg2015-11-252-3/+6
| | | | | | | | | | | | | | | | | | Since kernel 3.3, we can use snprintf-style arguments to create a workqueue. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * drbd: debugfs: expose ed_data_gen_idLars Ellenberg2015-11-252-0/+11
| | | | | | | | | | | | | | | | | | The effective data generation ID may be interesting for debugging purposes of scenarios involving diskless states. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>