| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull block updates from Jens Axboe:
"This is the main pull request for block IO related changes for the
4.16 kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This contains:
- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
Paolo.
- Support for SMR zones for deadline and mq-deadline from Damien and
Christoph.
- Set of fixes for bcache by way of Michael Lyle, including fixes
from himself, Kent, Rui, Tang, and Coly.
- Series from Matias for lightnvm with fixes from Hans Holmberg,
Javier, and Matias. Mostly centered around pblk, and the removing
rrpc 1.2 in preparation for supporting 2.0.
- A couple of NVMe pull requests from Christoph. Nothing major in
here, just fixes and cleanups, and support for command tracing from
Johannes.
- Support for blk-throttle for tracking reads and writes separately.
From Joseph Qi. A few cleanups/fixes also for blk-throttle from
Weiping.
- Series from Mike Snitzer that enables dm to register its queue more
logically, something that's alwways been problematic on dm since
it's a stacked device.
- Series from Ming cleaning up some of the bio accessor use, in
preparation for supporting multipage bvecs.
- Various fixes from Ming closing up holes around queue mapping and
quiescing.
- BSD partition fix from Richard Narron, fixing a problem where we
can't mount newer (10/11) FreeBSD partitions.
- Series from Tejun reworking blk-mq timeout handling. The previous
scheme relied on atomic bits, but it had races where we would think
a request had timed out if it to reused at the wrong time.
- null_blk now supports faking timeouts, to enable us to better
exercise and test that functionality separately. From me.
- Kill the separate atomic poll bit in the request struct. After
this, we don't use the atomic bits on blk-mq anymore at all. From
me.
- sgl_alloc/free helpers from Bart.
- Heavily contended tag case scalability improvement from me.
- Various little fixes and cleanups from Arnd, Bart, Corentin,
Douglas, Eryu, Goldwyn, and myself"
* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
block: remove smart1,2.h
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
bsg: use pr_debug instead of hand crafted macros
blk-mq-debugfs: don't allow write on attributes with seq_operations set
nvme-pci: Fix queue double allocations
block: Set BIO_TRACE_COMPLETION on new bio during split
blk-throttle: use queue_is_rq_based
block: Remove kblockd_schedule_delayed_work{,_on}()
blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
lib/scatterlist: Fix chaining support in sgl_alloc_order()
blk-throttle: track read and write request individually
block: add bdev_read_only() checks to common helpers
block: fail op_is_write() requests to read-only partitions
blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
...
|
| |
| |
| |
| |
| |
| |
| |
| | |
smart1,2.h is unused since commit d436641439e0 ("cpqarray: remove it from the kernel")
Remove it from tree.
Signed-off-by: Corentin Labbe <clabbe@baylibre.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull NVMe fixes from Christoph:
"The additional week before the 4.15 release gave us time for a few more
nvme fixes, as well as the nifty trace points from Johannes"
* 'nvme-4.16' of git://git.infradead.org/nvme:
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
nvme-pci: Fix queue double allocations
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add a tracepoint in nvme_complete_rq() for completions of NVMe commands. An
expmale output of the trace-point is as follows:
<idle>-0 [001] d.h. 3.505266: nvme_complete_rq: cmdid=989, qid=1, res=0, retries=0, flags=0x0, status=0
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add tracepoints for nvme_setup_cmd() for tracing admin and/or nvm commands.
Examples of the two tracepoints are as follows for trace_nvme_setup_admin_cmd():
kworker/u8:0-5 [003] .... 2.998792: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=1, qsize=1023, cq_flags=0x3, irq_vector=0)
and trace_nvme_setup_nvm_cmd():
dd-205 [001] .... 3.503929: nvme_setup_nvm_cmd: qid=1, nsid=1, cmdid=989, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=4096, len=2047, ctrl=0x0, dsmgmt=0, reftag=0)
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
both nvme-fc/rdma have following pattern:
RESETTING - quiesce blk-mq queues, teardown and delete queues/
connections, clear out outstanding IO requests...
RECONNECTING - establish new queues/connections and some other
initializing things.
Introduce RECONNECTING to nvme-pci transport to do the same mark.
Then we get a coherent state definition among nvme pci/rdma/fc
transports.
Suggested-by: James Smart <james.smart@broadcom.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID
option") fixed the leak of 'p' but in case uuid_parse() fails the memory
is freed before the error print that is using it.
Free it after printing eventual errors.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Fixes: df351ef73789 ("nvme-fabrics: fix memory leak when parsing host ID option")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The driver had been abusing the cq_vector state to know if new submissions
were safe, but that was before we could quiesce blk-mq. If the controller
happens to get an interrupt through while we're suspending those queues,
'no irq handler' warnings may occur.
This patch will disable the interrupts only after the queues are deleted.
Reported-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Tested-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The queue count says the highest queue that's been allocated, so don't
reallocate a queue lower than that.
Fixes: 147b27e4bd0 ("nvme-pci: allocate device queues storage space at probe")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Use pr_debug instead of hand crafted macros. This way it is not needed to
re-compile the kernel to enable bsg debug outputs and it's possible to
selectively enable specific prints.
Cc: Joe Perches <joe@perches.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Attributes that only implement .seq_ops are read-only, any write to
them should be rejected. But currently kernel would crash when
writing to such debugfs entries, e.g.
chmod +w /sys/kernel/debug/block/<dev>/requeue_list
echo 0 > /sys/kernel/debug/block/<dev>/requeue_list
chmod -w /sys/kernel/debug/block/<dev>/requeue_list
Fix it by returning -EPERM in blk_mq_debugfs_write() when writing to
such attributes.
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| | |
We inadvertently set it again on the source bio, but we need
to set it on the new split bio instead.
Fixes: fbbaf700e7b1 ("block: trace completion of all bios.")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| | |
use queue_is_rq_based instead of open code.
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The previous patch removed all users of these two functions. Hence
also remove the functions themselves.
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Make sure that calling blk_mq_run_hw_queue() or
blk_mq_kick_requeue_list() triggers a queue run without delay even
if blk_mq_delay_run_hw_queue() has been called recently and if its
delay has not yet expired.
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
blk_mq_request_issue_directly()
Most blk-mq functions have a name that follows the pattern blk_mq_${action}.
However, the function name blk_mq_request_direct_issue is an exception.
Hence rename this function. This patch does not change any functionality.
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch avoids that workloads with large block sizes (megabytes)
can trigger the following call stack with the ib_srpt driver (that
driver is the only driver that chains scatterlists allocated by
sgl_alloc_order()):
BUG: Bad page state in process kworker/0:1H pfn:2423a78
page:fffffb03d08e9e00 count:-3 mapcount:0 mapping: (null) index:0x0
flags: 0x57ffffc0000000()
raw: 0057ffffc0000000 0000000000000000 0000000000000000 fffffffdffffffff
raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
page dumped because: nonzero _count
CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G I 4.15.0-rc7.bart+ #1
Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
Call Trace:
dump_stack+0x5c/0x83
bad_page+0xf5/0x10f
get_page_from_freelist+0xa46/0x11b0
__alloc_pages_nodemask+0x103/0x290
sgl_alloc_order+0x101/0x180
target_alloc_sgl+0x2c/0x40 [target_core_mod]
srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt]
srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt]
__ib_process_cq+0x55/0xa0 [ib_core]
ib_cq_poll_work+0x1b/0x60 [ib_core]
process_one_work+0x141/0x340
worker_thread+0x47/0x3e0
kthread+0xf5/0x130
ret_from_fork+0x1f/0x30
Fixes: e80a0af4759a ("lib/scatterlist: Introduce sgl_alloc() and sgl_free()")
Reported-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull NVMe fixes for 4.16 from Christoph.
* 'nvme-4.16' of git://git.infradead.org/nvme:
nvme-pci: clean up SMBSZ bit definitions
nvme-pci: clean up CMB initialization
nvme-fc: correct hang in nvme_ns_remove()
nvme-fc: fix rogue admin cmds stalling teardown
nvmet: release a ns reference in nvmet_req_uninit if needed
nvme-fabrics: fix memory leak when parsing host ID option
nvme: fix comment typos in nvme_create_io_queues
nvme: host delete_work and reset_work on separate workqueues
nvme-pci: allocate device queues storage space at probe
nvme-pci: serialize pci resets
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Define the bit positions instead of macros using the magic values,
and move the expanded helpers to calculate the size and size unit into
the implementation C file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Refactor the call to nvme_map_cmb, and change the conditions for probing
for the CMB. First remove the version check as NVMe TPs always apply
to earlier versions of the spec as well. Second check for the whole CMBSZ
register for support of the CMB feature instead of just the size field
inside of it to simplify the code a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When connectivity is lost to a device, the association is terminated
and the blk-mq queues are quiesced/stopped. When connectivity is
re-established, they are resumed.
If connectivity is lost for a sufficient amount of time that the
controller is then deleted, the delete path starts tearing down queues,
and eventually calling nvme_ns_remove(). It appears that pending
commands may cause blk_cleanup_queue() to never complete and the
teardown stalls.
Correct by starting the ns queues after transitioning to a DELETING
state, allowing pending commands to be flushed with io failures. Thus
the delete path is clear when reached.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When connectivity is lost to a device, the association is terminated
and the blk-mq queues are quiesced/stopped. When connectivity is
re-established, they are resumed.
If an admin command is received while connectivity is list, the ioctl
queues the command on the admin_q and the command stalls (the thread
issuing the ioctl hangs/waits). if the connectivity is lost long
enough such that the controller is then deleted, the delete code
makes its calls to initiate the delete, which then expects the core
layer to call the transport when all references are removed and the
controller can be freed. Unfortunately, nothing in this path dequeued
the admin command, so a reference sits outstanding and things stop,
hanging the delete indefinitely.
Correct by unquiescing the admin queue in the delete association. This
means any admin command (which should only be from an ioctl) issued
after connectivity is lost will detect the controller is in a
reconnecting state and will (fast) fail the command. Thus, a pending
reference can no longer be created. Once connectivity is re-established,
a new ioctl/admin command would see proper device state and function again.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
nvmet_req_init looked up a namespace and took a reference on it (unless it
failed prior to that). If the request is uninitialized (in error cases) we
need to remove that reference in case it was taken, otherwise we leak
namespace reference when calling nvme_req_uninit.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We use match_strdup() to get a copy of the option string for host ID string, but
we just pass it to uuid_parse() and don't store the string pointer, so we need to
kfree() the string after parsing it.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
fix comment typos in nvme_create_io_queues() like below.
_aount_ to _amount_
_an_ to _can_
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We need to ensure that delete_work will be hosted on a different
workqueue than all the works we flush or cancel from it.
Otherwise we may hit a circular dependency warning [1].
Also, given that delete_work flushes reset_work, host reset_work
on nvme_reset_wq and delete_work on nvme_delete_wq. In addition,
fix the flushing in the individual drivers to flush nvme_delete_wq
when draining queued deletes.
[1]:
[ 178.491942] =============================================
[ 178.492718] [ INFO: possible recursive locking detected ]
[ 178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G OE
[ 178.494382] ---------------------------------------------
[ 178.495160] kworker/5:1/135 is trying to acquire lock:
[ 178.495894] (
[ 178.496120] "nvme-wq"
[ 178.496471] ){++++.+}
[ 178.496599] , at:
[ 178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
[ 178.497670]
but task is already holding lock:
[ 178.498499] (
[ 178.498724] "nvme-wq"
[ 178.499074] ){++++.+}
[ 178.499202] , at:
[ 178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[ 178.500343]
other info that might help us debug this:
[ 178.501269] Possible unsafe locking scenario:
[ 178.502113] CPU0
[ 178.502472] ----
[ 178.502829] lock(
[ 178.503115] "nvme-wq"
[ 178.503467] );
[ 178.503716] lock(
[ 178.504001] "nvme-wq"
[ 178.504353] );
[ 178.504601]
*** DEADLOCK ***
[ 178.505441] May be due to missing lock nesting notation
[ 178.506453] 2 locks held by kworker/5:1/135:
[ 178.507068] #0:
[ 178.507330] (
[ 178.507598] "nvme-wq"
[ 178.507726] ){++++.+}
[ 178.508079] , at:
[ 178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[ 178.509004] #1:
[ 178.509265] (
[ 178.509532] (&ctrl->delete_work)
[ 178.509795] ){+.+.+.}
[ 178.510145] , at:
[ 178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[ 178.511070]
stack backtrace:
:
[ 178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G OE 4.9.0-rc4-c844263313a8-lb #3
[ 178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[ 178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
[ 178.515071] ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
[ 178.516195] ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
[ 178.517318] ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
[ 178.518443] Call Trace:
[ 178.518807] [<ffffffffa7450823>] dump_stack+0x85/0xc2
[ 178.519542] [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
[ 178.520377] [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
[ 178.521330] [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
[ 178.522174] [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
[ 178.522975] [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
[ 178.523753] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[ 178.524535] [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
[ 178.525291] [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[ 178.526077] [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
[ 178.527040] [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
[ 178.527907] [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
[ 178.528726] [<ffffffffa71cb507>] ? printk+0x48/0x50
[ 178.529434] [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
[ 178.530381] [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
[ 178.531314] [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
[ 178.532271] [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
[ 178.533101] [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
[ 178.533954] [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
[ 178.534735] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[ 178.535588] [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[ 178.536441] [<ffffffffa70b48cf>] kthread+0xff/0x120
[ 178.537149] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[ 178.538094] [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[ 178.538900] [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
It may cause race by setting 'nvmeq' in nvme_init_request()
because .init_request is called inside switching io scheduler, which
may happen when the NVMe device is being resetted and its nvme queues
are being freed and created. We don't have any sync between the two
pathes.
This patch changes the nvmeq allocation to occur at probe time so
there is no way we can dereference it at init_request.
[ 93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
[ 93.274146] invalid opcode: 0000 [#1] SMP
[ 93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
[ 93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[ 93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
[ 93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
[ 93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
[ 93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[ 93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
[ 93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
[ 93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
[ 93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
[ 93.422969] FS: 00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
[ 93.431996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
[ 93.446368] Call Trace:
[ 93.449103] blk_mq_alloc_rqs+0x231/0x2a0
[ 93.453579] blk_mq_sched_alloc_tags.isra.8+0x42/0x80
[ 93.459214] blk_mq_init_sched+0x7e/0x140
[ 93.463687] elevator_switch+0x5a/0x1f0
[ 93.467966] ? elevator_get.isra.17+0x52/0xc0
[ 93.472826] elv_iosched_store+0xde/0x150
[ 93.477299] queue_attr_store+0x4e/0x90
[ 93.481580] kernfs_fop_write+0xfa/0x180
[ 93.485958] __vfs_write+0x33/0x170
[ 93.489851] ? __inode_security_revalidate+0x4c/0x60
[ 93.495390] ? selinux_file_permission+0xda/0x130
[ 93.500641] ? _cond_resched+0x15/0x30
[ 93.504815] vfs_write+0xad/0x1a0
[ 93.508512] SyS_write+0x52/0xc0
[ 93.512113] do_syscall_64+0x61/0x1a0
[ 93.516199] entry_SYSCALL64_slow_path+0x25/0x25
[ 93.521351] RIP: 0033:0x7f33ce96aab0
[ 93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
[ 93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
[ 93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
[ 93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
[ 93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
[ 93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
[ 93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
[ 93.602273] ---[ end trace 810dde3993e5f14e ]---
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | | |
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In mixed read/write workload on SSD, write latency is much lower than
read. But now we only track and record read latency and then use it as
threshold base for both read and write io latency accounting. As a
result, write io latency will always be considered as good and
bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
tg to be checked will be treated as idle most of the time and still let
others dispatch more ios, even it is truly running under low limit and
wants its low limit to be guaranteed, which is not we expected in fact.
So track read and write request individually, which can bring more
precise latency control for low limit idle detection.
Signed-off-by: Joseph Qi <qijiang.qj@alibaba-inc.com>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Similar to blkdev_write_iter(), return -EPERM if the partition is
read-only. This covers ioctl(), fallocate() and most in-kernel users
but isn't meant to be exhaustive -- everything else will be caught in
generic_make_request_checks(), fail with -EIO and can be fixed later.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Regular block device writes go through blkdev_write_iter(), which does
bdev_read_only(), while zeroout/discard/etc requests are never checked,
both userspace- and kernel-triggered. Add a generic catch-all check to
generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
set_disk_ro(), which is used by quite a few drivers for things like
snapshots, read-only backing files/images, etc.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
export these two interface for cgroup-v1.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The __blk_mq_register_dev(), blk_mq_unregister_dev(),
elv_register_queue() and elv_unregister_queue() calls need to be
protected with sysfs_lock but other code in these functions not.
Hence protect only this code with sysfs_lock. This patch fixes a
locking inversion issue in blk_unregister_queue() and also in an
error path of blk_register_queue(): it is not allowed to hold
sysfs_lock around the kobject_del(&q->kobj) call.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch does not change any functionality.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
These two functions are only called from inside the block layer so
unexport them.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
To maximise responsiveness, BFQ raises the weight, and performs device
idling, for bfq_queues associated with processes deemed as
interactive. In particular, weight raising has a maximum duration,
equal to the time needed to start a large application. If a
weight-raised process goes on doing I/O beyond this maximum duration,
it loses weight-raising.
This mechanism is evidently vulnerable to the following false
positives: I/O-bound applications that will go on doing I/O for much
longer than the duration of weight-raising. These applications have
basically no benefit from being weight-raised at the beginning of
their I/O. On the opposite end, while being weight-raised, these
applications
a) unjustly steal throughput to applications that may truly need
low latency;
b) make BFQ uselessly perform device idling; device idling results
in loss of device throughput with most flash-based storage, and may
increase latencies when used purposelessly.
This commit adds a countermeasure to reduce both the above
problems. To introduce this countermeasure, we provide the following
extra piece of information (full details in the comments added by this
commit). During the start-up of the large application used as a
reference to set the duration of weight-raising, involved processes
transfer at most ~110K sectors each. Accordingly, a process initially
deemed as interactive has no right to be weight-raised any longer,
once transferred 110K sectors or more.
Basing on this consideration, this commit early-ends weight-raising
for a bfq_queue if the latter happens to have received an amount of
service at least equal to 110K sectors (actually, a little bit more,
to keep a safety margin). I/O-bound applications that reach a high
throughput, such as file copy, get to this threshold much before the
allowed weight-raising period finishes. Thus this early ending of
weight-raising reduces the amount of time during which these
applications cause the problems described above.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Asynchronous I/O can easily starve synchronous I/O (both sync reads
and sync writes), by consuming all request tags. Similarly, storms of
synchronous writes, such as those that sync(2) may trigger, can starve
synchronous reads. In their turn, these two problems may also cause
BFQ to loose control on latency for interactive and soft real-time
applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
Writer takes 0.6 seconds to start if the device is idle, but it takes
more than 45 seconds (!) if there are sequential writes in the
background.
This commit addresses this issue by limiting the maximum percentage of
tags that asynchronous I/O requests and synchronous write requests can
consume. In particular, this commit grants a higher threshold to
synchronous writes, to prevent the latter from being starved by
asynchronous I/O.
According to the above test, LibreOffice Writer now starts in about
1.2 seconds on average, regardless of the background workload, and
apart from some rare outlier. To check this improvement, run, e.g.,
sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
for the comm_startup_lat benchmark in the S suite [1].
[1] https://github.com/Algodev-github/S
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If we run into blk_mq_request_direct_issue(), when queue is busy, we
don't want to dispatch this request into hctx->dispatch_list, and
what we need to do is to return the queue busy info to caller, so
that caller can deal with it well.
Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Reported-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
After commit:
923218f6166a ("blk-mq: don't allocate driver tag upfront for flush rq")
we no longer use the 'can_block' argument in
blk_mq_sched_insert_request(). Kill it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Added actual commit message as to why it's being removed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.
1) This way isn't efficient enough because the hctx spinlock is always
used.
2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO. But dm-rq currently can't get the underlying queue's
dispatch feedback at all. Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging). This
obviously causes very bad sequential IO performance.
Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t). If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request. Whereby preserving the opportunity to
merge IO.
With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).
Signed-off-by: Ming Lei <ming.lei@redhat.com>
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
No functional change. Just makes code flow more logically.
In following commit, __blk_mq_try_issue_directly() will be used to
return the dispatch result (blk_status_t) to DM. DM needs this
information to improve IO merging.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We know this WARN_ON is harmless and in reality it may be trigged,
so convert it to printk() and dump_stack() to avoid to confusing
people.
Also add comment about two releated races here.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.
The race can be triggered in the following two sitations:
1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.
2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.
This patch deals with this issue by re-selecting next CPU, and making
sure it is set correctly.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: "jianchao.wang" <jianchao.w.wang@oracle.com>
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
'struct frame' uses two variables to store the sent timestamp - 'struct
timeval' and jiffies. jiffies is used to avoid discrepancies caused by
updates to system time. 'struct timeval' is deprecated because it uses
32-bit representation for seconds which will overflow in year 2038.
This patch does the following:
- Replace the use of 'struct timeval' and jiffies with ktime_t, which
is the recommended type for timestamping
- ktime_t provides both long range (like jiffies) and high resolution
(like timeval). Using ktime_get (monotonic time) instead of wall-clock
time prevents any discprepancies caused by updates to system time.
[updates by Arnd below]
The original patch from Tina never went anywhere as we discussed how
to keep the impact on performance minimal. I've started over now but
arrived at basically the same patch that she had originally, except for
an slightly improved tsince_hr() function. I'm making it more robust
against overflows, and also optimize explicitly for the common case
in which a frame is less than 4.2 seconds old, using only a 32-bit
division in that case.
This should make the new version more efficient than the old code,
since we replace the existing two 32-bit division in do_gettimeofday()
plus one multiplication with a single single 32-bit division in
tsince_hr() and drop the double bookkeeping. It's also more efficient
than the ktime_get_us() API we discussed before, since that would
also rely on multiple divisions.
Link: https://lists.linaro.org/pipermail/y2038/2015-May/000276.html
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Cc: Ed Cashin <ed.cashin@acm.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Some older compilers (gcc-4.4 through 4.6 in particular) struggle
with the way that blkg_rwstat_read() returns a structure, leading
to excessive stack usage and rather inefficient code:
block/blk-cgroup.c: In function 'blkg_destroy':
block/blk-cgroup.c:354:1: error: the frame size of 1296 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
block/cfq-iosched.c: In function 'cfqg_stats_add_aux':
block/cfq-iosched.c:753:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
block/bfq-cgroup.c: In function 'bfqg_stats_add_aux':
block/bfq-cgroup.c:299:1: error: the frame size of 1928 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
I also notice that there is no point in using atomic accesses
for the local variables, so storing the temporaries in simple 'u64'
variables not only avoids the stack usage on older compilers but
also improves the object code on modern versions.
Fixes: e6269c445467 ("blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it")
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
DM is no longer prone to having its request_queue be improperly
initialized.
Summary of changes:
- defer DM's blk_register_queue() from add_disk()-time until
dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().
- dm_setup_md_queue() is updated to fully initialize DM's request_queue
(_after_ all table loads have occurred and the request_queue's type,
features and limits are known).
A very welcome side-effect of these changes is DM no longer needs to:
1) backfill the "mq" sysfs entry (because historically DM didn't
initialize the request_queue to use blk-mq until _after_
blk_register_queue() was called via add_disk()).
2) call elv_register_queue() to get .request_fn request-based DM
device's "iosched" exposed in syfs.
In addition, blk-mq debugfs support is now made available because
request-based DM's blk-mq request_queue is now properly initialized
before dm_setup_md_queue() calls blk_register_queue().
These changes also stave off the need to introduce new DM-specific
workarounds in block core, e.g. this proposal:
https://patchwork.kernel.org/patch/10067961/
In the end DM devices should be less unicorn in nature (relative to
initialization and availability of block core infrastructure provided by
the request_queue).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
During stress tests by syzkaller on the sg driver the block layer
infrequently returns EINVAL. Closer inspection shows the block
layer was trying to return ENOMEM (which is much more
understandable) but for some reason overroad that useful error.
Patch below does not show this (unchanged) line:
ret =__blk_rq_map_user_iov(rq, map_data, &i, gfp_mask, copy);
That 'ret' was being overridden when that function failed.
Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Since I can remember DM has forced the block layer to allow the
allocation and initialization of the request_queue to be distinct
operations. Reason for this is block/genhd.c:add_disk() has requires
that the request_queue (and associated bdi) be tied to the gendisk
before add_disk() is called -- because add_disk() also deals with
exposing the request_queue via blk_register_queue().
DM's dynamic creation of arbitrary device types (and associated
request_queue types) requires the DM device's gendisk be available so
that DM table loads can establish a master/slave relationship with
subordinate devices that are referenced by loaded DM tables -- using
bd_link_disk_holder(). But until these DM tables, and their associated
subordinate devices, are known DM cannot know what type of request_queue
it needs -- nor what its queue_limits should be.
This chicken and egg scenario has created all manner of problems for DM
and, at times, the block layer.
Summary of changes:
- Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
that drivers may use to add a disk without also calling
blk_register_queue(). Driver must call blk_register_queue() once its
request_queue is fully initialized.
- Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
is not set. It won't be set if driver used add_disk_no_queue_reg()
but driver encounters an error and must del_gendisk() before calling
blk_register_queue().
- Export blk_register_queue().
These changes allow DM to use add_disk_no_queue_reg() to anchor its
gendisk as the "master" for master/slave relationships DM must establish
with subordinate devices referenced in DM tables that get loaded. Once
all "slave" devices for a DM device are known its request_queue can be
properly initialized and then advertised via sysfs -- important
improvement being that no request_queue resource initialization
performed by blk_register_queue() is missed for DM devices anymore.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|