summaryrefslogtreecommitdiffstats
path: root/drivers/nvme/host/rdma.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* nvme: if_ready checks to fail io to deleting controllerJames Smart2018-07-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The revised if_ready checks skipped over the case of returning error when the controller is being deleted. Instead it was returning BUSY, which caused the ios to retry, which caused the ns delete to hang waiting for the ios to drain. Stack trace of hang looks like: kworker/u64:2 D 0 74 2 0x80000000 Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] Call Trace: ? __schedule+0x26d/0x820 schedule+0x32/0x80 blk_mq_freeze_queue_wait+0x36/0x80 ? remove_wait_queue+0x60/0x60 blk_cleanup_queue+0x72/0x160 nvme_ns_remove+0x106/0x140 [nvme_core] nvme_remove_namespaces+0x7e/0xa0 [nvme_core] nvme_delete_ctrl_work+0x4d/0x80 [nvme_core] process_one_work+0x160/0x350 worker_thread+0x1c3/0x3d0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_bind+0x10/0x10 ret_from_fork+0x1f/0x30 Extend nvmf_fail_nonready_command() to supply the controller pointer so that the controller state can be looked at. Fail any io to a controller that is deleting. Fixes: 3bc32bb1186c ("nvme-fabrics: refactor queue ready check") Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready") Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com>
* nvme-rdma: fix possible double free of controller async event bufferSagi Grimberg2018-06-281-2/+5
| | | | | | | | | | | | | | If reconnect/reset failed where the controller async event buffer was freed, we might end up freeing it again as we call nvme_rdma_destroy_admin_queue again in the remove path. Given that the sequence is guaranteed to serialize by .ctrl_stop, we simply set ctrl->async_event_sqe.data to NULL and don't free it in future visits. Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: don't override opts->queue_sizeSagi Grimberg2018-06-201-11/+10
| | | | | | | | | | That is user argument, and theoretically controller limits can change over time (over reconnects/resets). Instead, use the sqsize controller attribute to check queue depth boundaries and use it to the tagset allocation. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: Fix command completion race at error recoveryIsrael Rukshin2018-06-201-2/+6
| | | | | | | | | | | | | | | | | | | | | | The race is between completing the request at error recovery work and rdma completions. If we cancel the request before getting the good rdma completion we get a NULL deref of the request MR at nvme_rdma_process_nvme_rsp(). When Canceling the request we return its mr to the mr pool (set mr to NULL) and also unmap its data. Canceling the requests while the rdma queues are active is not safe. Because rdma queues are active and we get good rdma completions that can use the mr pointer which may be NULL. Completing the request too soon may lead also to performing DMA to/from user buffers which might have been already unmapped. The commit fixes the race by draining the QP before starting the abort commands mechanism. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: fix possible free of a non-allocated async event bufferSagi Grimberg2018-06-201-13/+11
| | | | | | | | | | | | | If nvme_rdma_configure_admin_queue fails before we allocated the async event buffer, we will falsly free it because nvme_rdma_free_queue is freeing it. Fix it by allocating the buffer right after nvme_rdma_alloc_queue and free it right before nvme_rdma_queue_free to maintain orderly reverse cleanup sequence. Reported-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: fix possible double free condition when failing to create a ↵Sagi Grimberg2018-06-201-10/+10
| | | | | | | | | | | | | | | controller Failures after nvme_init_ctrl will defer resource cleanups to .free_ctrl when the reference is released, hence we should not free the controller queues for these failures. Fix that by moving controller queues allocation before controller initialization and correctly freeing them for failures before initialization and skip them for failures after initialization. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-fabrics: refactor queue ready checkChristoph Hellwig2018-06-151-4/+3
| | | | | | | | | | | | | Move the is_connected check to the fibre channel transport, as it has no meaning for other transports. To facilitate this split out a new nvmf_fail_nonready_command helper that is called by the transport when it is asked to handle a command on a queue that is not ready. Also avoid a function call for the queue live fast path by inlining the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com>
* nvme-rdma: fix error flow during mapping request dataMax Gurtovoy2018-06-111-7/+24
| | | | | | | | After dma mapping the sgl, we map the sgl to nvme sgl descriptor. In case of failure during the last mapping we never dma unmap the sgl. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: correctly check for target keyed sgl supportSteve Wise2018-06-081-2/+3
| | | | | | | | | The code was checking bit 20 instead of bit 2. Also fixed the log entry. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'nvme-4.18-2' of git://git.infradead.org/nvme into for-4.18/blockJens Axboe2018-05-291-4/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe changes from Christoph: "Here is the current batch of nvme updates for 4.18, we have a few more patches in the queue, but I'd like to get this pile into your tree and linux-next ASAP. The biggest item is support for file-backed namespaces in the NVMe target from Chaitanya, in addition to that we mostly small fixes from all the usual suspects." * 'nvme-4.18-2' of git://git.infradead.org/nvme: nvme: fixup memory leak in nvme_init_identify() nvme: fix KASAN warning when parsing host nqn nvmet-loop: use nr_phys_segments when map rq to sgl nvmet-fc: increase LS buffer count per fc port nvmet: add simple file backed ns support nvmet: remove duplicate NULL initialization for req->ns nvmet: make a few error messages more generic nvme-fabrics: allow duplicate connections to the discovery controller nvme-fabrics: centralize discovery controller defaults nvme-fabrics: remove unnecessary controller subnqn validation nvme-fc: remove setting DNR on exception conditions nvme-rdma: stop admin queue before freeing it nvme-pci: Fix AER reset handling nvme-pci: set nvmeq->cq_vector after alloc cq/sq nvme: host: core: fix precedence of ternary operator nvme: fix lockdep warning in nvme_mpath_clear_current_path
| * nvme-rdma: stop admin queue before freeing itJianchao Wang2018-05-251-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For any failure after nvme_rdma_start_queue in nvme_rdma_configure_admin_queue, the admin queue will be freed with the NVME_RDMA_Q_LIVE flag still set. Once nvme_rdma_stop_queue is invoked, that will cause a use-after-free. BUG: KASAN: use-after-free in rdma_disconnect+0x1f/0xe0 [rdma_cm] To fix it, call nvme_rdma_stop_queue for all the failed cases after nvme_rdma_start_queue. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Suggested-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme: return BLK_EH_DONE from ->timeoutChristoph Hellwig2018-05-291-1/+1
|/ | | | | | | | | | | NVMe always completes the request before returning from ->timeout, either by polling for it, or by disabling the controller. Return BLK_EH_DONE so that the block layer doesn't even try to complete it again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: expand nvmf_check_if_ready checksJames Smart2018-04-121-12/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nvmf_check_if_ready() checks that were added are very simplistic. As such, the routine allows a lot of cases to fail ios during windows of reset or re-connection. In cases where there are not multi-path options present, the error goes back to the callee - the filesystem or application. Not good. The common routine was rewritten and calling syntax slightly expanded so that per-transport is_ready routines don't need to be present. The transports now call the routine directly. The routine is now a fabrics routine rather than an inline function. The routine now looks at controller state to decide the action to take. Some states mandate io failure. Others define the condition where a command can be accepted. When the decision is unclear, a generic queue-or-reject check is made to look for failfast or multipath ios and only fails the io if it is so marked. Otherwise, the io will be queued and wait for the controller state to resolve. Admin commands issued via ioctl share a live admin queue with commands from the transport for controller init. The ioctls could be intermixed with the initialization commands. It's possible for the ioctl cmd to be issued prior to the controller being enabled. To block this, the ioctl admin commands need to be distinguished from admin commands used for controller init. Added a USERCMD nvme_req(req)->rq_flags bit to reflect this division and set it on ioctls requests. As the nvmf_check_if_ready() routine is called prior to nvme_setup_cmd(), ensure that commands allocated by the ioctl path (actually anything in core.c) preps the nvme_req(req) before starting the io. This will preserve the USERCMD flag during execution and/or retry. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.e> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: Add .stop_ctrl to nvme ctrl opsNitzan Carmi2018-03-261-3/+9
| | | | | | | | | | | | | | | | | For consistancy reasons, any fabric-specific works (e.g error recovery/reconnect) should be canceled in nvme_stop_ctrl, as for all other NVMe pending works (e.g. scan, keep alive). The patch aims to simplify the logic of the code, as we now only rely on a vague demand from any fabric to flush its private workqueues at the beginning of .delete_ctrl op. Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme-rdma: Allow DELETING state change failure in error_recoveryNitzan Carmi2018-03-261-2/+2
| | | | | | | | | | | | While error recovery is ongoing, it is OK to move ctrl to DELETING state (from concurrent delete_work). Thus we don't need a warning for that case. Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme-rdma: Don't flush delete_wq by default during remove_oneMax Gurtovoy2018-03-261-0/+14
| | | | | | | | | | | | The .remove_one function is called for any ib_device removal. In case the removed device has no reference in our driver, there is no need to flush the work queue. Reviewed-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: centralize ctrl removal printsMax Gurtovoy2018-03-261-4/+0
| | | | | | | | | | | | nvme_delete_ctrl can be called from various contexts in parallel, and cause duplicated information prints, even though the specific context doesn't perform the actual removal. Instead, print the information when the actual removal occurs. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'for-jens' of git://git.infradead.org/nvme into for-linusJens Axboe2018-02-281-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | Pull NVMe fixes from Keith for 4.16-rc. * 'for-jens' of git://git.infradead.org/nvme: nvmet: fix PSDT field check in command format nvme-multipath: fix sysfs dangerously created links nvme-pci: Fix nvme queue cleanup if IRQ setup fails nvmet-loop: use blk_rq_payload_bytes for sgl selection nvme-rdma: use blk_rq_payload_bytes instead of blk_rq_bytes nvme-fabrics: don't check for non-NULL module in nvmf_register_transport
| * nvme-rdma: use blk_rq_payload_bytes instead of blk_rq_bytesChristoph Hellwig2018-02-221-2/+2
| | | | | | | | | | | | | | | | | | | | blk_rq_bytes does the wrong thing for special payloads like discards and might cause the driver to not set up a SGL. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
* | nvme-rdma: fix sysfs invoked reset_ctrl error flowNitzan Carmi2018-02-141-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reset_controller that is invoked by sysfs fails, it enters an error flow which practically removes the nvme ctrl entirely (similar to delete_ctrl flow). It causes the system to hang, since a sysfs attribute cannot be unregistered by one of its own methods. This can be fixed by calling delete_ctrl as a work rather than sequential code. In addition, it should give the ctrl a chance to recover using reconnection mechanism (consistant with FC reset_ctrl error flow). Also, while we're here, return suitable errno in case the reset ended with non live ctrl. Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
* | nvme-rdma: use NVME_CTRL_CONNECTING state to mark init processMax Gurtovoy2018-02-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | In order to avoid concurrent error recovery during initialization process (allowed by the NVME_CTRL_NEW --> NVME_CTRL_RESETTING transition) we must mark the ctrl as CONNECTING before initial connection establisment. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
* | nvme: rename NVME_CTRL_RECONNECTING state to NVME_CTRL_CONNECTINGMax Gurtovoy2018-02-081-3/+3
|/ | | | | | | | | In pci transport, this state is used to mark the initialization process. This should be also used in other transports as well. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
* Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-blockLinus Torvalds2018-01-291-4/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: "This is the main pull request for block IO related changes for the 4.16 kernel. Nothing major in this pull request, but a good amount of improvements and fixes all over the map. This contains: - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and Paolo. - Support for SMR zones for deadline and mq-deadline from Damien and Christoph. - Set of fixes for bcache by way of Michael Lyle, including fixes from himself, Kent, Rui, Tang, and Coly. - Series from Matias for lightnvm with fixes from Hans Holmberg, Javier, and Matias. Mostly centered around pblk, and the removing rrpc 1.2 in preparation for supporting 2.0. - A couple of NVMe pull requests from Christoph. Nothing major in here, just fixes and cleanups, and support for command tracing from Johannes. - Support for blk-throttle for tracking reads and writes separately. From Joseph Qi. A few cleanups/fixes also for blk-throttle from Weiping. - Series from Mike Snitzer that enables dm to register its queue more logically, something that's alwways been problematic on dm since it's a stacked device. - Series from Ming cleaning up some of the bio accessor use, in preparation for supporting multipage bvecs. - Various fixes from Ming closing up holes around queue mapping and quiescing. - BSD partition fix from Richard Narron, fixing a problem where we can't mount newer (10/11) FreeBSD partitions. - Series from Tejun reworking blk-mq timeout handling. The previous scheme relied on atomic bits, but it had races where we would think a request had timed out if it to reused at the wrong time. - null_blk now supports faking timeouts, to enable us to better exercise and test that functionality separately. From me. - Kill the separate atomic poll bit in the request struct. After this, we don't use the atomic bits on blk-mq anymore at all. From me. - sgl_alloc/free helpers from Bart. - Heavily contended tag case scalability improvement from me. - Various little fixes and cleanups from Arnd, Bart, Corentin, Douglas, Eryu, Goldwyn, and myself" * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits) block: remove smart1,2.h nvme: add tracepoint for nvme_complete_rq nvme: add tracepoint for nvme_setup_cmd nvme-pci: introduce RECONNECTING state to mark initializing procedure nvme-rdma: remove redundant boolean for inline_data nvme: don't free uuid pointer before printing it nvme-pci: Suspend queues after deleting them bsg: use pr_debug instead of hand crafted macros blk-mq-debugfs: don't allow write on attributes with seq_operations set nvme-pci: Fix queue double allocations block: Set BIO_TRACE_COMPLETION on new bio during split blk-throttle: use queue_is_rq_based block: Remove kblockd_schedule_delayed_work{,_on}() blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly() lib/scatterlist: Fix chaining support in sgl_alloc_order() blk-throttle: track read and write request individually block: add bdev_read_only() checks to common helpers block: fail op_is_write() requests to read-only partitions blk-throttle: export io_serviced_recursive, io_service_bytes_recursive ...
| * nvme-rdma: remove redundant boolean for inline_dataMax Gurtovoy2018-01-251-3/+0
| | | | | | | | | | | | | | Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: host delete_work and reset_work on separate workqueuesRoy Shterman2018-01-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to ensure that delete_work will be hosted on a different workqueue than all the works we flush or cancel from it. Otherwise we may hit a circular dependency warning [1]. Also, given that delete_work flushes reset_work, host reset_work on nvme_reset_wq and delete_work on nvme_delete_wq. In addition, fix the flushing in the individual drivers to flush nvme_delete_wq when draining queued deletes. [1]: [ 178.491942] ============================================= [ 178.492718] [ INFO: possible recursive locking detected ] [ 178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G OE [ 178.494382] --------------------------------------------- [ 178.495160] kworker/5:1/135 is trying to acquire lock: [ 178.495894] ( [ 178.496120] "nvme-wq" [ 178.496471] ){++++.+} [ 178.496599] , at: [ 178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0 [ 178.497670] but task is already holding lock: [ 178.498499] ( [ 178.498724] "nvme-wq" [ 178.499074] ){++++.+} [ 178.499202] , at: [ 178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.500343] other info that might help us debug this: [ 178.501269] Possible unsafe locking scenario: [ 178.502113] CPU0 [ 178.502472] ---- [ 178.502829] lock( [ 178.503115] "nvme-wq" [ 178.503467] ); [ 178.503716] lock( [ 178.504001] "nvme-wq" [ 178.504353] ); [ 178.504601] *** DEADLOCK *** [ 178.505441] May be due to missing lock nesting notation [ 178.506453] 2 locks held by kworker/5:1/135: [ 178.507068] #0: [ 178.507330] ( [ 178.507598] "nvme-wq" [ 178.507726] ){++++.+} [ 178.508079] , at: [ 178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.509004] #1: [ 178.509265] ( [ 178.509532] (&ctrl->delete_work) [ 178.509795] ){+.+.+.} [ 178.510145] , at: [ 178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0 [ 178.511070] stack backtrace: : [ 178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G OE 4.9.0-rc4-c844263313a8-lb #3 [ 178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014 [ 178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp] [ 178.515071] ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80 [ 178.516195] ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700 [ 178.517318] ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e [ 178.518443] Call Trace: [ 178.518807] [<ffffffffa7450823>] dump_stack+0x85/0xc2 [ 178.519542] [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0 [ 178.520377] [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30 [ 178.521330] [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0 [ 178.522174] [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0 [ 178.522975] [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220 [ 178.523753] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0 [ 178.524535] [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0 [ 178.525291] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0 [ 178.526077] [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220 [ 178.527040] [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0 [ 178.527907] [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40 [ 178.528726] [<ffffffffa71cb507>] ? printk+0x48/0x50 [ 178.529434] [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20 [ 178.530381] [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core] [ 178.531314] [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp] [ 178.532271] [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0 [ 178.533101] [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0 [ 178.533954] [<ffffffffa70adc4e>] worker_thread+0x4e/0x490 [ 178.534735] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0 [ 178.535588] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0 [ 178.536441] [<ffffffffa70b48cf>] kthread+0xff/0x120 [ 178.537149] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60 [ 178.538094] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60 [ 178.538900] [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40 Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-fabrics: protect against module unload during create_ctrlRoy Shterman2018-01-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NVMe transport driver module unload may (and usually does) trigger iteration over the active controllers and delete them all (sometimes under a mutex). However, a controller can be created concurrently with module unload which can lead to leakage of resources (most important char device node leakage) in case the controller creation occured after the unload delete and drain sequence. To protect against this, we take a module reference to guarantee that the nvme transport driver is not unloaded while creating a controller. Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme-rdma: fix concurrent reset and reconnectSagi Grimberg2017-12-291-1/+13
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | Now ctrl state machine allows to transition from RESETTING to RECONNECTING. In nvme-rdma when we receive a rdma cm DISONNECTED event, we trigger nvme_rdma_error_recovery. This happens also when we execute a controller reset, issue a cm diconnect request and receive a cm disconnect reply, as a result, the reset work and the error recovery work can run concurrently. Until now the state machine prevented from the error recovery work from running as a result of a controller reset (RESETTING -> RECONNECTING was not allowed). To fix this, we adopt the FC state machine approach, we always transition from LIVE to RESETTING and only then to RECONNECTING. We do this both for the error recovery work and the controller reset work: 1. transition to RESETTING 2. teardown the controller association 3. transition to RECONNECTING This will restore the protection against reset work and error recovery work from concurrently running together. Fixes: 3cec7f9de448 ("nvme: allow controller RESETTING to RECONNECTING transition") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: fix memory leak during queue allocationMax Gurtovoy2017-11-281-3/+18
| | | | | | | | | | | | | | | In case nvme_rdma_wait_for_cm timeout expires before we get an established or rejected event (rdma_connect succeeded) from rdma_cm, we end up with leaking the ib transport resources for dedicated queue. This scenario can easily reproduced using traffic test during port toggling. Also, in order to protect from parallel ib queue destruction, that may be invoked from different context's, introduce new flag that stands for transport readiness. While we're here, protect also against a situation that we can receive rdma_cm events during ib queue destruction. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: Use mr poolIsrael Rukshin2017-11-261-58/+37
| | | | | | | | | | | | | | | | | | | | | Currently, blk_mq_tagset_iter() iterate over initial hctx tags only. If an I/O scheduler is used, it doesn't iterate the hctx scheduler tags and the static request aren't been updated. For example, while using NVMe over Fabrics RDMA host, this cause us not to reinit the scheduler requests and thus not re-register all the memory regions during the tagset re-initialization in the reconnect flow. This may lead to a memory registration error: "MEMREG for CQE 0xffff88044c14dce8 failed with status memory management operation error (6)" With this commit we don't need to reinit the requests, and thus fix this failure. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: Check remotely invalidated rkey matches our expected rkeySagi Grimberg2017-11-261-2/+7
| | | | | | | | | If we got a remote invalidation on a bogus rkey, this is a protocol error. Fail the connection in this case. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: wait for local invalidation before completing a requestSagi Grimberg2017-11-261-15/+24
| | | | | | | | | | | | We must not complete a request before the host memory region is invalidated. Luckily we have send with invalidate protocol support so we usually don't need to execute it, but in case the target did not invalidate a memory region for us, we must wait for the invalidation to complete before unmapping host memory and completing the I/O. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: don't complete requests before a send work request has completedSagi Grimberg2017-11-261-4/+24
| | | | | | | | | | | | | | | | | In order to guarantee that the HCA will never get an access violation (either from invalidated rkey or from iommu) when retrying a send operation we must complete a request only when both send completion and the nvme cqe has arrived. We need to set the send/recv completions flags atomically because we might have more than a single context accessing the request concurrently (one is cq irq-poll context and the other is user-polling used in IOCB_HIPRI). Only then we are safe to invalidate the rkey (if needed), unmap the host buffers, and complete the IO. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-rdma: don't suppress send completionsSagi Grimberg2017-11-261-40/+14
| | | | | | | | | | | | | | The entire completions suppress mechanism is currently broken because the HCA might retry a send operation (due to dropped ack) after the nvme transaction has completed. In order to handle this, we signal all send completions and introduce a separate done handler for async events as they will be handled differently (as they don't include in-capsule data by definition). Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-fabrics: introduce init command check for a queue that is not aliveSagi Grimberg2017-11-201-26/+6
| | | | | | | | | | | | | When the fabrics queue is not alive and fully functional, no commands should be allowed to pass but connect (which moves the queue to a fully functional state). Any other command should be failed, with either temporary status BLK_STS_RESOUCE or permanent status BLK_STS_IOERR. This is shared across all fabrics, hence move the check to fabrics library. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-blockLinus Torvalds2017-11-151-105/+141
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block layer updates from Jens Axboe: "This is the main pull request for block storage for 4.15-rc1. Nothing out of the ordinary in here, and no API changes or anything like that. Just various new features for drivers, core changes, etc. In particular, this pull request contains: - A patch series from Bart, closing the whole on blk/scsi-mq queue quescing. - A series from Christoph, building towards hidden gendisks (for multipath) and ability to move bio chains around. - NVMe - Support for native multipath for NVMe (Christoph). - Userspace notifications for AENs (Keith). - Command side-effects support (Keith). - SGL support (Chaitanya Kulkarni) - FC fixes and improvements (James Smart) - Lots of fixes and tweaks (Various) - bcache - New maintainer (Michael Lyle) - Writeback control improvements (Michael) - Various fixes (Coly, Elena, Eric, Liang, et al) - lightnvm updates, mostly centered around the pblk interface (Javier, Hans, and Rakesh). - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph) - Writeback series that fix the much discussed hundreds of millions of sync-all units. This goes all the way, as discussed previously (me). - Fix for missing wakeup on writeback timer adjustments (Yafang Shao). - Fix laptop mode on blk-mq (me). - {mq,name} tupple lookup for IO schedulers, allowing us to have alias names. This means you can use 'deadline' on both !mq and on mq (where it's called mq-deadline). (me). - blktrace race fix, oopsing on sg load (me). - blk-mq optimizations (me). - Obscure waitqueue race fix for kyber (Omar). - NBD fixes (Josef). - Disable writeback throttling by default on bfq, like we do on cfq (Luca Miccio). - Series from Ming that enable us to treat flush requests on blk-mq like any other request. This is a really nice cleanup. - Series from Ming that improves merging on blk-mq with schedulers, getting us closer to flipping the switch on scsi-mq again. - BFQ updates (Paolo). - blk-mq atomic flags memory ordering fixes (Peter Z). - Loop cgroup support (Shaohua). - Lots of minor fixes from lots of different folks, both for core and driver code" * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits) nvme: fix visibility of "uuid" ns attribute blk-mq: fixup some comment typos and lengths ide: ide-atapi: fix compile error with defining macro DEBUG blk-mq: improve tag waiting setup for non-shared tags brd: remove unused brd_mutex blk-mq: only run the hardware queue if IO is pending block: avoid null pointer dereference on null disk fs: guard_bio_eod() needs to consider partitions xtensa/simdisk: fix compile error nvme: expose subsys attribute to sysfs nvme: create 'slaves' and 'holders' entries for hidden controllers block: create 'slaves' and 'holders' entries for hidden gendisks nvme: also expose the namespace identification sysfs files for mpath nodes nvme: implement multipath access to nvme subsystems nvme: track shared namespaces nvme: introduce a nvme_ns_ids structure nvme: track subsystems block, nvme: Introduce blk_mq_req_flags_t block, scsi: Make SCSI quiesce and resume work reliably block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag ...
| * nvme: remove handling of multiple AEN requestsKeith Busch2017-11-111-4/+1
| | | | | | | | | | | | | | | | | | | | | | The driver can handle tracking only one AEN request, so this patch removes handling for multiple ones. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme: centralize AEN definesKeith Busch2017-11-111-11/+3
| | | | | | | | | | | | | | | | | | | | All the transports were unnecessarilly duplicating the AEN request accounting. This patch defines everything in one place. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme-rdma: fix nvme_rdma_create_queue_ib error flowMax Gurtovoy2017-11-111-1/+1
| | | | | | | | | | | | | | | | | | QP object is created using rdma_cm api, therefore the destruction should use the same api for symmetry. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme-rdma: reuse nvme_delete_ctrl when reconnect attempts expireSagi Grimberg2017-11-011-1/+1
| | | | | | | | | | | | | | | | instead of just queueing delete work Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: consolidate common code from ->reset_workChristoph Hellwig2017-11-011-4/+0
| | | | | | | | | | | | | | | | | | | | No change in behavior except that the FC code cancels two work items a little later now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-rdma: remove nvme_rdma_remove_ctrlChristoph Hellwig2017-11-011-10/+8
| | | | | | | | | | | | | | | | | | It is only used in two places, and some of the work done by it will be taken into common code soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: move controller deletion to common codeChristoph Hellwig2017-11-011-39/+6
| | | | | | | | | | | | | | | | | | | | | | Move the ->delete_work and the associated helpers to common code instead of duplicating them in every driver. This also adds the missing reference get/put for the loop driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-rdma: add support for duplicate_connect optionJames Smart2017-10-271-0/+82
| | | | | | | | | | | | | | | | | | | | | | | | Adds support for the duplicate_connect option. When set to true, checks whether there's an existing controller via the same target address (traddr), target port (trsvcid), and if specified, host address (host_traddr). Fails the connection request if there is an existing controller. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: switch controller refcounting to use struct deviceChristoph Hellwig2017-10-271-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of allocating a separate struct device for the character device handle embedd it into struct nvme_ctrl and use it for the main controller refcounting. This removes double refcounting and gets us an automatic reference for the character device operations. We keep ctrl->device as a pointer for now to avoid chaning printks all over, but in the future we could look into message printing helpers that take a controller structure similar to what other subsystems do. Note the delete_ctrl operation always already has a reference (either through sysfs due this change, or because every open file on the /dev/nvme-fabrics node has a refernece) when it is entered now, so we don't need to do the unless_zero variant there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com>
| * nvme-rdma: Add debug message when reaches timeoutNitzan Carmi2017-10-231-0/+4
| | | | | | | | | | | | Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-rdma: align nvme_rdma_device structureMax Gurtovoy2017-10-231-2/+2
| | | | | | | | | | Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-rdma: Add BLK_MQ_F_NO_SCHED flag to admin tag setIsrael Rukshin2017-10-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Since commit b86dd81 "block: get rid of blk-mq default scheduler choice Kconfig entries", when setting nr_hw_queues to 1 the admin tag set uses mq-deadline scheduler. This flag is useful for admin queues that aren't used for normal IO. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-rdma: stop controller reset if the controller is deletingSagi Grimberg2017-10-181-1/+5
| | | | | | | | | | | | | | | | | | If the controller is deleting (in case the user decided to delete it), we have no point to continue reset sequence. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-rdma: change queue flag semantics DELETING -> ALLOCATEDSagi Grimberg2017-10-181-4/+4
| | | | | | | | | | | | | | | | Instead of marking we are deleting, mark we are allocated and check that instead. This makes the logic symmetrical to connected mark check. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-rdma: Don't local invalidate if the queue is not liveSagi Grimberg2017-10-181-1/+1
| | | | | | | | | | | | | | | | | | No chance for the local invalidate to succeed if the queue-pair is in error state. Most likely the target will do a remote invalidation of our mr so not a big loss on the test_bit. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>