summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* lightnvm: export set bad block tableJavier González2016-11-295-50/+77
| | | | | | | | | | | | | | | Bad blocks should be managed by block owners. This would be either targets for data blocks or sysblk for system blocks. In order to support this, export two functions: One to mark a block as an specific type (e.g., bad block) and another to update the bad block table on the device. Move bad block management to rrpc. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* lightnvm: do not protect block 0Javier González2016-11-291-6/+0
| | | | | | | | | | | Device blocks should be marked by the device and considered as bad blocks by the media manager. Thus, do not make assumptions on which blocks are going to be used by the device. In doing so we might lose valid blocks from the free list. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* lightnvm: enable to send hint to erase commandJavier González2016-11-296-13/+15
| | | | | | | | | | Erases might be subject to host hints. An example is multi-plane programming to erase blocks in parallel. Enable targets to specify this hint. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: lightnvm: attach lightnvm sysfs to nvme block deviceMatias Bjørling2016-11-298-298/+197
| | | | | | | | | Previously, LBA read and write were not supported in the lightnvm specification. Now that it supports it, lets use the traditional NVMe gendisk, and attach the lightnvm sysfs geometry export. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: lightnvm: frees wrong cmd structureMatias Bjørling2016-11-291-1/+1
| | | | | | | | | | | When struct nvme_request was introduced, the nvme_nvm_submit_io was converted to the new interface. The interface moves nvme_nvm_command data structure into the struct request pdu. On io completion, rq->cmd is freed, which should have been the dereferenced pdu nvme_request->cmd. Fixes: d49187e97e94 "nvme: introduce struct nvme_request" Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: Drop explicit timeout sync in hotplugGabriel Krisman Bertazi2016-11-291-8/+1
| | | | | | | | | | | After commit 287922eb0b18 ("block: defer timeouts to a workqueue"), deleting the timeout work after freezing the queue shouldn't be necessary, since the synchronization is already enforced by the acquisition of a q_usage_counter reference in blk_mq_timeout_work. Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Reviewed-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: allow wbt to be enabled always through sysfsJens Axboe2016-11-283-7/+29
| | | | | | | | | | | Currently there's no way to enable wbt if it's not enabled in the kernel config by default for a device. Allow a write to the 'wbt_lat_usec' queue sysfs file to enable wbt. This is useful for both the kernel config case, but also if the device is CFQ managed and it was turned off by default. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: cleanup disable-by-default for CFQJens Axboe2016-11-283-10/+13
| | | | | | | | | Make it clear that we are disabling wbt for the specified queued, if it was enabled by default. This is in preparation for allowing users to re-enable wbt, and not have it disabled automatically again. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: allow reset of default latency through sysfsJens Axboe2016-11-284-12/+37
| | | | | | | | Allow a write of '-1' to reset the default latency target for a given device. This removes knowledge of the different default settings for rotational vs non-rotational from user space. Signed-off-by: Jens Axboe <axboe@fb.com>
* nbd: fix setting of 'error' in NBD_DO_IT ioctlJens Axboe2016-11-231-2/+5
| | | | | | | Multiple paths don't set it properly, ensure that we do. Fixes: 9561a7ade0c2 ("nbd: add multi-connection support") Signed-off-by: Jens Axboe <axboe@fb.com>
* nbd: move multi-connection bit to unused valueJens Axboe2016-11-221-1/+1
| | | | | | | | Bit #7 is already used, move to bit #8 which is the first unused one. Fixes: 9561a7ade0c2 ("nbd: add multi-connection support") Signed-off-by: Jens Axboe <axboe@fb.com>
* nbd: add multi-connection supportJosef Bacik2016-11-222-148/+243
| | | | | | | | | | | NBD can become contended on its single connection. We have to serialize all writes and we can only process one read response at a time. Fix this by allowing userspace to provide multiple connections to a single nbd device. This coupled with block-mq drastically increases performance in multi-process cases. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcgTejun Heo2016-11-222-5/+7
| | | | | | | | | | | | | | | | blkcg allocates some per-cgroup data structures with GFP_NOWAIT and when that fails falls back to operations which aren't specific to the cgroup. Occassional failures are expected under pressure and falling back to non-cgroup operation is the right thing to do. Unfortunately, I forgot to add __GFP_NOWARN to these allocations and these expected failures end up creating a lot of noise. Add __GFP_NOWARN. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Marc MERLIN <marc@merlins.org> Reported-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* fs: logfs: remove unnecesary checkMing Lei2016-11-221-1/+0
| | | | | | | | The check on bio->bi_vcnt doesn't make sense in erase_end_io(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* fs: logfs: use bio_add_page() in do_erase()Ming Lei2016-11-221-29/+15
| | | | | | | | Also code gets simplified a bit. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* fs: logfs: use bio_add_page() in __bdev_writeseg()Ming Lei2016-11-221-31/+20
| | | | | | | | Also this patch simplify the code a bit. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* fs: logfs: convert to bio_add_page() in sync_request()Ming Lei2016-11-221-5/+1
| | | | | | | | | Always bio_add_page() is the standard and preferred way to do the task. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* bcache: debug: avoid accessing .bi_io_vec directlyMing Lei2016-11-221-3/+8
| | | | | | | | Instead we use standard iterator way to do that. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* target: avoid accessing .bi_vcnt directlyMing Lei2016-11-221-6/+2
| | | | | | | | | | | | When the bio is full, bio_add_pc_page() will return zero, so use this information tell when the bio is full. Also replace access to .bi_vcnt for pr_debug() with bio_segments(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: floppy: use bio_add_page()Ming Lei2016-11-221-5/+2
| | | | | | Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: drbd: remove impossible failure handlingMing Lei2016-11-221-13/+1
| | | | | | | | | | | | | For a non-cloned bio, bio_add_page() only returns failure when the io vec table is full, but in that case, bio->bi_vcnt can't be zero at all. So remove the impossible failure handling. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: bio: pass bvec table to bio_init()Ming Lei2016-11-2217-50/+28
| | | | | | | | | | | | | | | | | | Some drivers often use external bvec table, so introduce this helper for this case. It is always safe to access the bio->bi_io_vec in this way for this case. After converting to this usage, it will becomes a bit easier to evaluate the remaining direct access to bio->bi_io_vec, so it can help to prepare for the following multipage bvec support. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixed up the new O_DIRECT cases. Signed-off-by: Jens Axboe <axboe@fb.com>
* block_dev: get rid of blksize bits calculationJens Axboe2016-11-221-4/+4
| | | | | | | | | We store the bits in the bdev sector size locally, but we don't use the calculation anymore. All we do with it is shift it back up to the bdev sector size. So let's just use that directly and kill the variable and bits calculation. Signed-off-by: Jens Axboe <axboe@fb.com>
* block_dev: Fixed direct I/O bio sector calculationDamien Le Moal2016-11-221-2/+2
| | | | | | | | | | A direct I/O alignment must be always checked against the device blocks size, but the I/O offset (bio->bi_iter.bi_sector must always use 512B sector unit, and not the actual logical block size. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: apply blk_partition_remap to REQ_OP_ZONE_RESETShaun Tancheff2016-11-211-1/+6
| | | | | | | | | | | | | If a ZBC device is partitioned and operations are performed on the partition the zone information is rebased to the partition, however the zone reset is not mapped from the partition to device as are other operations. This causes the API (report zones / reset zone) to be unbalanced in this regard. Checking for the zone reset op code explicitly will balance the API. Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: clear all of bi_opf in bio_set_op_attrsChristoph Hellwig2016-11-211-2/+5
| | | | | | | | | | | | | Since commit 87374179 ("block: add a proper block layer data direction encoding") we only or the new op and flags into bi_opf in bio_set_op_attrs instead of clearing the old value. I've not seen any breakage with the new behavior, but it seems dangerous. Also convert it to an inline function to make the argument passing safer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* pktcdvd: mark as unmaintained and deprecatedJens Axboe2016-11-212-3/+6
| | | | | | | | This driver is both orphaned, and not really useful anymore. Mark it as such, and remove it in a future kernel after a release or two. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Change extern inline to static inlineTobias Klauser2016-11-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | With compilers which follow the C99 standard (like modern versions of gcc and clang), "extern inline" does the opposite thing from older versions of gcc (emits code for an externally linkable version of the inline function). "static inline" does the intended behavior in all cases instead. Description taken from commit 6d91857d4826 ("staging, rtl8192e, LLVMLinux: Change extern inline to static inline"). This also fixes the following GCC warning when building with CONFIG_PM disabled: ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes] Fixes: d07ab6d11477 ("block: Add blk_set_runtime_active()") Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Jens Axboe <axboe@fb.com>
* skd_main: drop duplicate header scatterlist.hGeliang Tang2016-11-181-1/+0
| | | | | | | Drop duplicate header scatterlist.h from skd_main.c. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: document the 'io_poll_delay' queue sysfs fileJens Axboe2016-11-181-0/+14
| | | | | | | This was documented in the original commit, 64f1c21e86f7, but it never made it into the proper location for queue sysfs files. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: new direct I/O implementationChristoph Hellwig2016-11-171-4/+162
| | | | | | | | | | | | | Similar to the simple fast path, but we now need a dio structure to track multiple-bio completions. It's basically a cut-down version of the new iomap-based direct I/O code for filesystems, but without all the logic to call into the filesystem for extent lookup or allocation, and without the complex I/O completion workqueue handler for AIO - instead we just use the FUA bit on the bios to ensure data is flushed to stable storage. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: make __blkdev_direct_IO_sync() support O_SYNC/DSYNCJens Axboe2016-11-171-2/+12
| | | | | | Split the op setting code into a helper, use it in both places. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: support a full bio worth of IO for simplified bdev direct-ioJens Axboe2016-11-171-4/+15
| | | | | | Just alloc the bio_vec array if we exceed the inline limit. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: make the polling code adaptiveJens Axboe2016-11-173-12/+83
| | | | | | | | | | | | | | | | | | | The previous commit introduced the hybrid sleep/poll mode. Take that one step further, and use the completion latencies to automatically sleep for half the mean completion time. This is a good approximation. This changes the 'io_poll_delay' sysfs file a bit to expose the various options. Depending on the value, the polling code will behave differently: -1 Never enter hybrid sleep mode 0 Use half of the completion mean for the sleep delay >0 Use this specific value as the sleep delay Signed-off-by: Jens Axboe <axboe@fb.com> Tested-By: Stephen Bates <sbates@raithlin.com> Reviewed-By: Stephen Bates <sbates@raithlin.com>
* blk-mq: implement hybrid poll mode for sync O_DIRECTJens Axboe2016-11-174-0/+81
| | | | | | | | | | | | | | | This patch enables a hybrid polling mode. Instead of polling after IO submission, we can induce an artificial delay, and then poll after that. For example, if the IO is presumed to complete in 8 usecs from now, we can sleep for 4 usecs, wake up, and then do our polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, we can achieve big latency reductions while still using the same (or less) amount of CPU. Signed-off-by: Jens Axboe <axboe@fb.com> Tested-By: Stephen Bates <sbates@raithlin.com> Reviewed-By: Stephen Bates <sbates@raithlin.com>
* block: fast-path for small and simple direct I/O requestsChristoph Hellwig2016-11-171-0/+80
| | | | | | | | | | | | | | | This patch adds a small and simple fast patch for small direct I/O requests on block devices that don't use AIO. Between the neat bio_iov_iter_get_pages helper that avoids allocating a page array for get_user_pages and the on-stack bio and biovec this avoid memory allocations and atomic operations entirely in the direct I/O code (lower levels might still do memory allocations and will usually have at least some atomic operations, though). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com> Tested-By: Stephen Bates <sbates@raithlin.com> Reviewed-By: Stephen Bates <sbates@raithlin.com>
* nbd: fix use-after-free of rq/bio in the xmit pathJens Axboe2016-11-171-9/+23
| | | | | | | | | | | | For writes, we can get a completion in while we're still iterating the request and bio chain. If that happens, we're reading freed memory and we can crash. Break out after the last segment and avoid having the iterator read freed memory. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: fix old-style function declarationArnd Bergmann2016-11-161-1/+1
| | | | | | | | | | | | The newly added driver causes a harmless warning in some configurations: block/blk-wbt.c:250:1: error: ‘inline’ is not at beginning of declaration [-Werror=old-style-declaration] static bool inline stat_sample_valid(struct blk_rq_stat *stat) This makes it use the expected format for the declaration. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* null_blk: add usage hints for NVMYasuaki Ishimatsu2016-11-162-1/+2
| | | | | | | | | | | | | If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1 fails. But there are no messages and documents related to the failure. Add the appropriate error message. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Massaged the text a bit. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: deal with stale req count of plug listMing Lei2016-11-162-1/+11
| | | | | | | | | | | | | | | | | In both legacy and mq path, req count of plug list is computed before allocating request, so the number can be stale when falling back to slept allocation, also the new introduced wbt can sleep too. This patch deals with the case by checking if plug list becomes empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds' which is introduced by Shaohua's patches of dispatching big request. Fixes: 600271d900002(blk-mq: immediately dispatch big size request) Fixes: 50d24c34403c6(block: immediately dispatch big size request) Cc: Shaohua Li <shli@fb.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* scsi_lib: untangle 0 and BLK_MQ_RQ_QUEUE_OKOmar Sandoval2016-11-151-3/+3
| | | | | | | | Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having specific values. No functional change. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: untangle 0 and BLK_MQ_RQ_QUEUE_OKOmar Sandoval2016-11-154-10/+10
| | | | | | | | | Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having specific values. No functional change. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* loop: return proper error from loop_queue_rq()Omar Sandoval2016-11-141-1/+1
| | | | | | | | | | ->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not an errno. f4aa4c7bbac6 ("block: loop: convert to per-device workqueue") Signed-off-by: Omar Sandoval <osandov@fb.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>
* sd_zbc: Force use of READ16/WRITE16Damien Le Moal2016-11-141-0/+4
| | | | | | | | | | | | | | | | | Normally, sd_read_capacity sets sdp->use_16_for_rw to 1 based on the disk capacity so that READ16/WRITE16 are used for large drives. However, for a zoned disk with RC_BASIS set to 0, the capacity reported through READ_CAPACITY may be very small, leading to use_16_for_rw not being set and READ10/WRITE10 commands being used, even after the actual zoned disk capacity is corrected in sd_zbc_read_zones. This causes LBA offset overflow for accesses beyond 2TB. As the ZBC standard makes it mandatory for ZBC drives to support the READ16/WRITE16 commands anyway, make sure that use_16_for_rw is set. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> eviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* bsg: Add sparse annotations to bsg_request_fn()Bart Van Assche2016-11-141-0/+2
| | | | | | | Avoid that sparse complains about unbalanced lock actions. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: use BLK_STAT_{READ,WRITE} instead of 0/1Jens Axboe2016-11-121-6/+6
| | | | | | Since we have proper enums for the stats directions, use them. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: remove stat opsJens Axboe2016-11-123-43/+8
| | | | | | | Again a leftover from when the throttling code was generic. Now that we just have the block user, get rid of the stat ops and indirections. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: store queue instead of bdiJens Axboe2016-11-122-11/+13
| | | | | | | The bdi was a leftover from when the code was block layer agnostic. Now that we just support a block layer user, store the queue directly. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: move poll code to blk-mqJens Axboe2016-11-115-49/+57
| | | | | | | | The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* Block: mtip32xx: Improvement in code readability when memdup_user() fails.Sachin Shukla2016-11-111-9/+5
| | | | | | | | There is no need to call kfree() if memdup_user() fails, as no memory was allocated and the error in the error-valued pointer should be returned. Signed-off-by: Sachin Shukla <sachin.s5@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>