summaryrefslogtreecommitdiffstats
path: root/drivers/block/rbd.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* rbd: require stable pages if message data CRCs are enabledRonny Hegewald2015-10-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | rbd requires stable pages, as it performs a crc of the page data before they are send to the OSDs. But since kernel 3.9 (patch 1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0 "mm: only enforce stable page writes if the backing device requires it") it is not assumed anymore that block devices require stable pages. This patch sets the necessary flag to get stable pages back for rbd. In a ceph installation that provides multiple ext4 formatted rbd devices "bad crc" messages appeared regularly (ca 1 message every 1-2 minutes on every OSD that provided the data for the rbd) in the OSD-logs before this patch. After this patch this messages are pretty much gone (only ca 1-2 / month / OSD). Cc: stable@vger.kernel.org # 3.9+, needs backporting Signed-off-by: Ronny Hegewald <Ronny.Hegewald@online.de> [idryomov@gmail.com: require stable pages only in crc case, changelog] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: prevent kernel stack blow up on rbd mapIlya Dryomov2015-10-231-10/+23
| | | | | | | | | | | | | | | | | | | | | | Mapping an image with a long parent chain (e.g. image foo, whose parent is bar, whose parent is baz, etc) currently leads to a kernel stack overflow, due to the following recursion in the reply path: rbd_osd_req_callback() rbd_obj_request_complete() rbd_img_obj_callback() rbd_img_parent_read_callback() rbd_obj_request_complete() ... Limit the parent chain to 16 images, which is ~5K worth of stack. When the above recursion is eliminated, this limit can be lifted. Fixes: http://tracker.ceph.com/issues/12538 Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
* rbd: don't leak parent_spec in rbd_dev_probe_parent()Ilya Dryomov2015-10-231-20/+16
| | | | | | | | | | | | | | | | Currently we leak parent_spec and trigger a "parent reference underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. The problem is we take the !parent out_err branch and that only drops refcounts; parent_spec that would've been freed had we called rbd_dev_unparent() remains and triggers rbd_warn() in rbd_dev_parent_put() - at that point we have parent_spec != NULL and parent_ref == 0, so counter ends up being -1 after the decrement. Redo rbd_dev_probe_parent() to fix this. Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: use writefull op for object size writesIlya Dryomov2015-10-161-2/+7
| | | | | | | | | | | | | | | This covers only the simplest case - an object size sized write, but it's still useful in tiering setups when EC is used for the base tier as writefull op can be proxied, saving an object promotion. Even though updating ceph_osdc_new_request() to allow writefull should just be a matter of fixing an assert, I didn't do it because its only user is cephfs. All other sites were updated. Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: set max_sectors explicitlyIlya Dryomov2015-10-161-0/+1
| | | | | | | | | | | | | Commit 30e2bc08b2bb ("Revert "block: remove artifical max_hw_sectors cap"") restored a clamp on max_sectors. It's now 2560 sectors instead of 1024, but it's not good enough: we set max_hw_sectors to rbd object size because we don't want object sized I/Os to be split, and the default object size is 4M. So, set max_sectors to max_hw_sectors in rbd at queue init time. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-09-111-2/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph update from Sage Weil: "There are a few fixes for snapshot behavior with CephFS and support for the new keepalive protocol from Zheng, a libceph fix that affects both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya, and several small fixes and cleanups from Jianpeng and others" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: improve readahead for file holes ceph: get inode size for each append write libceph: check data_len in ->alloc_msg() libceph: use keepalive2 to verify the mon session is alive rbd: plug rbd_dev->header.object_prefix memory leak rbd: fix double free on rbd_dev->header_name libceph: set 'exists' flag for newly up osd ceph: cleanup use of ceph_msg_get ceph: no need to get parent inode in ceph_open ceph: remove the useless judgement ceph: remove redundant test of head->safe and silence static analysis warnings ceph: fix queuing inode to mdsdir's snaprealm libceph: rename con_work() to ceph_con_workfn() libceph: Avoid holding the zero page on ceph_msgr_slab_init errors libceph: remove the unused macro AES_KEY_SIZE ceph: invalidate dirty pages after forced umount ceph: EIO all operations after forced umount
| * rbd: plug rbd_dev->header.object_prefix memory leakIlya Dryomov2015-09-081-1/+4
| | | | | | | | | | | | | | | | Need to free object_prefix when rbd_dev_v2_snap_context() fails, but only if this is the first time we are reading in the header. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * rbd: fix double free on rbd_dev->header_nameIlya Dryomov2015-09-081-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name is freed twice: once in rbd_dev_probe_parent() and then in its caller rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to handle parent images). rbd_dev_probe_parent() is responsible for probing the parent, so it shouldn't muck with clone's fields. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* | Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds2015-09-021-48/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
| * block: kill merge_bvec_fn() completelyKent Overstreet2015-08-131-47/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: have drivers use blk_queue_max_discard_sectors()Jens Axboe2015-07-171-1/+1
| | | | | | | | | | | | | | | | | | | | Some drivers use it now, others just set the limits field manually. But in preparation for splitting this into a hard and soft limit, ensure that they all call the proper function for setting the hw limit for discards. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | rbd: fix copyup completion raceIlya Dryomov2015-07-311-5/+17
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For write/discard obj_requests that involved a copyup method call, the opcode of the first op is CEPH_OSD_OP_CALL and the ->callback is rbd_img_obj_copyup_callback(). The latter frees copyup pages, sets ->xferred and delegates to rbd_img_obj_callback(), the "normal" image object callback, for reporting to block layer and putting refs. rbd_osd_req_callback() however treats CEPH_OSD_OP_CALL as a trivial op, which means obj_request is marked done in rbd_osd_trivial_callback(), *before* ->callback is invoked and rbd_img_obj_copyup_callback() has a chance to run. Marking obj_request done essentially means giving rbd_img_obj_callback() a license to end it at any moment, so if another obj_request from the same img_request is being completed concurrently, rbd_img_obj_end_request() may very well be called on such prematurally marked done request: <obj_request-1/2 reply> handle_reply() rbd_osd_req_callback() rbd_osd_trivial_callback() rbd_obj_request_complete() rbd_img_obj_copyup_callback() rbd_img_obj_callback() <obj_request-2/2 reply> handle_reply() rbd_osd_req_callback() rbd_osd_trivial_callback() for_each_obj_request(obj_request->img_request) { rbd_img_obj_end_request(obj_request-1/2) rbd_img_obj_end_request(obj_request-2/2) <-- } Calling rbd_img_obj_end_request() on such a request leads to trouble, in particular because its ->xfferred is 0. We report 0 to the block layer with blk_update_request(), get back 1 for "this request has more data in flight" and then trip on rbd_assert(more ^ (which == img_request->obj_request_count)); with rhs (which == ...) being 1 because rbd_img_obj_end_request() has been called for both requests and lhs (more) being 1 because we haven't got a chance to set ->xfferred in rbd_img_obj_copyup_callback() yet. To fix this, leverage that rbd wants to call class methods in only two cases: one is a generic method call wrapper (obj_request is standalone) and the other is a copyup (obj_request is part of an img_request). So make a dedicated handler for CEPH_OSD_OP_CALL and directly invoke rbd_img_obj_copyup_callback() from it if obj_request is part of an img_request, similar to how CEPH_OSD_OP_READ handler invokes rbd_img_obj_request_read_callback(). Since rbd_img_obj_copyup_callback() is now being called from the OSD request callback (only), it is renamed to rbd_osd_copyup_callback(). Cc: Alex Elder <elder@linaro.org> Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 3.18 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: use GFP_NOIO in rbd_obj_request_create()Ilya Dryomov2015-06-301-2/+2
| | | | | | | | | | | | | | | rbd_obj_request_create() is called on the main I/O path, so we need to use GFP_NOIO to make sure allocation doesn't blow back on us. Not all callers need this, but I'm still hardcoding the flag inside rather than making it a parameter because a) this is going to stable, and b) those callers shouldn't really use rbd_obj_request_create() and will be fixed in the future. More memory allocation fixes will follow. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: queue_depth map optionIlya Dryomov2015-06-251-3/+14
| | | | | | | | | | | | | | | | | nr_requests (/sys/block/rbd<id>/queue/nr_requests) is pretty much irrelevant in blk-mq case because each driver sets its own max depth that it can handle and that's the number of tags that gets preallocated on setup. Users can't increase queue depth beyond that value via writing to nr_requests. For rbd we are happy with the default BLKDEV_MAX_RQ (128) for most cases but we want to give users the opportunity to increase it. Introduce a new per-device queue_depth option to do just that: $ sudo rbd map -o queue_depth=1024 ... Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: store rbd_options in rbd_deviceIlya Dryomov2015-06-251-7/+11
| | | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: terminate rbd_opts_tokens with Opt_errIlya Dryomov2015-06-251-16/+8
| | | | | | | Also nuke useless Opt_last_bool and don't break lines unnecessarily. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: bump queue_max_segmentsIlya Dryomov2015-06-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128) unnecessarily limits bio sizes to 512k (assuming 4k pages). rbd, being a virtual block device, doesn't have any restrictions on the number of physical segments, so bump max_segments to max_hw_sectors, in theory allowing a sector per segment (although the only case this matters that I can think of is some readv/writev style thing). In practice this is going to give us 1M bios - the number of segments in a bio is limited in bio_get_nr_vecs() by BIO_MAX_PAGES = 256. Note that this doesn't result in any improvement on a typical direct sequential test. This is because on a box with a not too badly fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice rbd object size sized requests. The only difference is the size of bios being merged - 512k vs 1M for something like $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: timeout watch teardown on unmap with mount_timeoutIlya Dryomov2015-06-251-10/+28
| | | | | | | | | | | | | | | | | | As part of unmap sequence, kernel client has to talk to the OSDs to teardown watch on the header object. If none of the OSDs are available it would hang forever, until interrupted by a signal - when that happens we follow through with the rest of unmap procedure (i.e. unregister the device and put all the data structures) and the unmap is still considired successful (rbd cli tool exits with 0). The watch on the userspace side should eventually timeout so that's fine. This isn't very nice, because various userspace tools (pacemaker rbd resource agent, for example) then have to worry about setting up their own timeouts. Timeout it with mount_timeout (60 seconds by default). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: store timeouts in jiffies, verify user inputIlya Dryomov2015-06-251-2/+3
| | | | | | | | | | | | | | | | | | | | | | There are currently three libceph-level timeouts that the user can specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of these are in seconds and no checking is done on user input: negative values are accepted, we multiply them all by HZ which may or may not overflow, arbitrarily large jiffies then get added together, etc. There is also a bug in the way mount_timeout=0 is handled. It's supposed to mean "infinite timeout", but that's not how wait.h APIs treat it and so __ceph_open_session() for example will busy loop without much chance of being interrupted if none of ceph-mons are there. Fix all this by verifying user input, storing timeouts capped by msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies() helper for all user-specified waits to handle infinite timeouts correctly. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* libceph: allow setting osd_req_op's flagsYan, Zheng2015-06-251-2/+2
| | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: end I/O the entire obj_request on errorIlya Dryomov2015-05-021-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | When we end I/O struct request with error, we need to pass obj_request->length as @nr_bytes so that the entire obj_request worth of bytes is completed. Otherwise block layer ends up confused and we trip on rbd_assert(more ^ (which == img_request->obj_request_count)); in rbd_img_obj_callback() due to more being true no matter what. We already do it in most cases but we are missing some, in particular those where we don't even get a chance to submit any obj_requests, due to an early -ENOMEM for example. A number of obj_request->xferred assignments seem to be redundant but I haven't touched any of obj_request->xferred stuff to keep this small and isolated. Cc: Alex Elder <elder@linaro.org> Cc: stable@vger.kernel.org # 3.10+ Reported-by: Shawn Edwards <lesser.evil@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: rbd_wq comment is obsoleteIlya Dryomov2015-04-221-1/+1
| | | | | | After the switch to blk-mq rbd_wq processes requests, not devices. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: mark block queue as non-rotationalIlya Dryomov2015-04-201-2/+2
| | | | | | | | | | | | Set QUEUE_FLAG_NONROT. Following commit b277da0a8a59 ("block: disable entropy contributions for nonrot devices") we should also clear QUEUE_FLAG_ADD_RANDOM, but it's off by default for blk-mq drivers, so just note it in the comment. Also remove physical block size assignment - no sense in repeating defaults that are not going to change. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: be more informative on -ENOENT failuresIlya Dryomov2015-04-201-3/+17
| | | | | | pr_info what exactly was the culprit: missing pool, image or snap. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: convert to blk-mqChristoph Hellwig2015-02-191-54/+68
| | | | | | | | | | | | | | This converts the rbd driver to use the blk-mq infrastructure. Except for switching to a per-request work item this is almost mechanical. This was tested by Alexandre DERUMIER in November, and found to give him 120000 iops, although the only comparism available was an old 3.10 kernel which gave 80000iops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <elder@linaro.org> [idryomov@gmail.com: context, blk_mq_init_queue() EH] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: do not treat standalone as flattenIlya Dryomov2015-02-191-20/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the clone is resized down to 0, it becomes standalone. If such resize is carried over while an image is mapped we would detect this and call rbd_dev_parent_put() which means "let go of all parent state, including the spec(s) of parent images(s)". This leads to a mismatch between "rbd info" and sysfs parent fields, so a fix is in order. # rbd create --image-format 2 --size 1 foo # rbd snap create foo@snap # rbd snap protect foo@snap # rbd clone foo@snap bar # DEV=$(rbd map bar) # rbd resize --allow-shrink --size 0 bar # rbd resize --size 1 bar # rbd info bar | grep parent parent: rbd/foo@snap Before: # cat /sys/bus/rbd/devices/0/parent (no parent image) After: # cat /sys/bus/rbd/devices/0/parent pool_id 0 pool_name rbd image_id 10056b8b4567 image_name foo snap_id 2 snap_name snap overlap 0 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: fix error paths in rbd_dev_refresh()Ilya Dryomov2015-02-191-7/+6
| | | | | | | header_rwsem should be released on errors. Also remove useless rbd_dev->mapping.size != rbd_dev->header.image_size test. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* rbd: nuke copy_token()Rickard Strandqvist2015-02-191-30/+0
| | | | | | | | | It's been largely superseded by dup_token() and unused for over 2 years, identified by cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> [idryomov@redhat.com: changelog] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* rbd: drop parent_ref in rbd_dev_unprobe() unconditionallyIlya Dryomov2015-01-281-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This effectively reverts the last hunk of 392a9dad7e77 ("rbd: detect when clone image is flattened"). The problem with parent_overlap != 0 condition is that it's possible and completely valid to have an image with parent_overlap == 0 whose parent state needs to be cleaned up on unmap. The next commit, which drops the "clone image now standalone" logic, opens up another window of opportunity to hit this, but even without it # cat parent-ref.sh #!/bin/bash rbd create --image-format 2 --size 1 foo rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd resize --allow-shrink --size 0 bar rbd resize --size 1 bar DEV=$(rbd map bar) rbd unmap $DEV leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client hanging around. My thinking behind calling rbd_dev_parent_put() unconditionally is that there shouldn't be any requests in flight at that point in time as we are deep into unmap sequence. Hence, even if rbd_dev_unparent() caused by flatten is delayed by in-flight requests, it will have finished by the time we reach rbd_dev_unprobe() caused by unmap, thus turning unconditional rbd_dev_parent_put() into a no-op. Fixes: http://tracker.ceph.com/issues/10352 Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: fix rbd_dev_parent_get() when parent_overlap == 0Ilya Dryomov2015-01-281-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | The comment for rbd_dev_parent_get() said * We must get the reference before checking for the overlap to * coordinate properly with zeroing the parent overlap in * rbd_dev_v2_parent_info() when an image gets flattened. We * drop it again if there is no overlap. but the "drop it again if there is no overlap" part was missing from the implementation. This lead to absurd parent_ref values for images with parent_overlap == 0, as parent_ref was incremented for each img_request and virtually never decremented. Fix this by leveraging the fact that refresh path calls rbd_dev_v2_parent_info() under header_rwsem and use it for read in rbd_dev_parent_get(), instead of messing around with atomics. Get rid of barriers in rbd_dev_v2_parent_info() while at it - I don't see what they'd pair with now and I suspect we are in a pretty miserable situation as far as proper locking goes regardless. Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: don't treat CEPH_OSD_OP_DELETE as extent opIlya Dryomov2014-12-171-2/+6
| | | | | | | | | | CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such. This sneaked in with discard patches - it's one of the three osd ops (the other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard is implemented with. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
* ceph, rbd: delete unnecessary checks before two function callsSF Markus Elfring2014-12-171-2/+1
| | | | | | | | | | | | The functions ceph_put_snap_context() and iput() test whether their argument is NULL and then return immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> [idryomov@redhat.com: squashed rbd.c hunk, changelog] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* rbd: Fix error recovery in rbd_obj_read_sync()Jan Kara2014-10-301-1/+1
| | | | | | | | | | | | | | When we fail to allocate page vector in rbd_obj_read_sync() we just basically ignore the problem and continue which will result in an oops later. Fix the problem by returning proper error. CC: Yehuda Sadeh <yehuda@inktank.com> CC: Sage Weil <sage@inktank.com> CC: ceph-devel@vger.kernel.org CC: stable@vger.kernel.org Coverity-id: 1226882 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* rbd: use a single workqueue for all devicesIlya Dryomov2014-10-301-15/+18
| | | | | | | | | Using one queue per device doesn't make much sense given that our workfn processes "devices" and not "requests". Switch to a single workqueue for all devices. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* rbd: rbd workqueues need a resque workerIlya Dryomov2014-10-141-1/+2
| | | | | | | | | | Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups under memory pressure - we sit on the memory reclaim path. Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Tested-by: Micha Krause <micha@krausam.de> Reviewed-by: Sage Weil <sage@redhat.com>
* rbd: set the remaining discard properties to enable supportJosh Durgin2014-10-141-0/+2
| | | | | | | max_discard_sectors must be set for the queue to support discard. Operations implementing discard for rbd zero data, so report that. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* rbd: use helpers to handle discard for layered images correctlyJosh Durgin2014-10-141-32/+22
| | | | | | | | | | | Only allocate two osd ops for discard requests, since the preallocation hint is only added for regular writes. Use rbd_img_obj_request_fill() to recreate the original write or discard osd operations, isolating that logic to one place, and change the assert in rbd_osd_req_create_copyup() to accept discard requests as well. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* rbd: extract a method for adding object operationsJosh Durgin2014-10-141-55/+78
| | | | | | | | | | | | | | | | rbd_img_request_fill() creates a ceph_osd_request and has logic for adding the appropriate osd ops to it based on the request type and image properties. For layered images, the original rbd_obj_request is resent with a copyup operation in front, using a new ceph_osd_request. The logic for adding the original operations should be the same as when first sending them, so move it to a helper function. op_type only needs to be checked once, so create a helper for that as well and call it outside the loop in rbd_img_request_fill(). Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* rbd: make discard trigger copy-on-writeJosh Durgin2014-10-141-1/+2
| | | | | | | | Discard requests are a form of write, so they should go through the same process as plain write requests and trigger copy-on-write for layered images. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* rbd: tolerate -ENOENT for discard operationsJosh Durgin2014-10-141-0/+3
| | | | | | | | Discard may try to delete an object from a non-layered image that does not exist. If this occurs, the image already has no data in that range, so change the result to success. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* rbd: fix snapshot context reference count for discardsJosh Durgin2014-10-141-1/+2
| | | | | | | | Discards take a reference to the snapshot context of an image when they are created. This reference needs to be cleaned up when the request is done just as it is for regular writes. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* rbd: read image size for discard check safelyJosh Durgin2014-10-141-6/+12
| | | | | | | | | | In rbd_img_request_fill() the image size is only checked to determine whether we can truncate an object instead of zeroing it for discard requests. Take rbd_dev->header_rwsem while reading the image size, and move this read into the discard check, so that non-discard ops don't need to take the semaphore in this function. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* rbd: initial discard bits from Guangliang ZhaoGuangliang Zhao2014-10-141-15/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch add the discard support for rbd driver. There are three types operation in the driver: 1. The objects would be removed if they completely contained within the discard range. 2. The objects would be truncated if they partly contained within the discard range, and align with their boundary. 3. Others would be zeroed. A discard request from blkdev_issue_discard() is defined which REQ_WRITE and REQ_DISCARD both marked and no data, so we must check the REQ_DISCARD first when getting the request type. This resolve: http://tracker.ceph.com/issues/190 [ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up commits by Josh Durgin for refinements and fixes which weren't folded in to preserve authorship. ] Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: extend the operation typeGuangliang Zhao2014-10-141-34/+63
| | | | | | | | | It could only handle the read and write operations now, extend it for the coming discard support. Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: skip the copyup when an entire object writingGuangliang Zhao2014-10-141-0/+8
| | | | | | | | | It need to copyup the parent's content when layered writing, but an entire object write would overwrite it, so skip it. Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: add img_obj_request_simple() helperIlya Dryomov2014-10-141-16/+28
| | | | | | To clarify the conditions and make it easier to add new ones. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
* rbd: access snapshot context and mapping size safelyJosh Durgin2014-10-141-13/+21
| | | | | | | | | | | | | | | | | These fields may both change while the image is mapped if a snapshot is created or deleted or the image is resized. They are guarded by rbd_dev->header_rwsem, so hold that while reading them, and store a local copy to refer to outside of the critical section. The local copy will stay consistent since the snapshot context is reference counted, and the mapping size is just a u64. This prevents torn loads from giving us inconsistent values. Move reading header.snapc into the caller of rbd_img_request_create() so that we only need to take the semaphore once. The read-only caller, rbd_parent_request_create() can just pass NULL for snapc, since the snapshot context is only relevant for writes. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
* rbd: do not return -ERANGE on auth failuresIlya Dryomov2014-10-141-3/+1
| | | | | | | | | | | Trying to map an image out of a pool for which we don't have an 'x' permission bit fails with -ERANGE from ceph_extract_encoded_string() due to an unsigned vs signed bug. Fix it and get rid of the -EINVAL sink, thus propagating rbd::get_id cls method errors. (I've seen a bunch of unexplained -ERANGE reports, I bet this is it). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
* rbd: fix error return code in rbd_dev_device_setup()Wei Yongjun2014-09-101-1/+3
| | | | | | | | Fix to return -ENOMEM from the workqueue alloc error handling case instead of 0, as done elsewhere in this function. Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
* rbd: avoid format-security warning inside alloc_workqueue()Ilya Dryomov2014-09-101-1/+1
| | | | | | | | drivers/block/rbd.c: In function ‘rbd_dev_device_setup’: drivers/block/rbd.c:5090:19: warning: format not a string literal and no format arguments [-Wformat-security] Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>