summaryrefslogtreecommitdiffstats
path: root/block/blk-mq.h (unfollow)
Commit message (Collapse)AuthorFilesLines
2018-10-05nvmet-rdma: use a private workqueue for deleteSagi Grimberg1-4/+15
Queue deletion is done asynchronous when the last reference on the queue is dropped. Thus, in order to make sure we don't over allocate under a connect/disconnect storm, we let queue deletion complete before making forward progress. However, given that we flush the system_wq from rdma_cm context which runs from a workqueue context, we can have a circular locking complaint [1]. Fix that by using a private workqueue for queue deletion. [1]: ====================================================== WARNING: possible circular locking dependency detected 4.19.0-rc4-dbg+ #3 Not tainted ------------------------------------------------------ kworker/5:0/39 is trying to acquire lock: 00000000a10b6db9 (&id_priv->handler_mutex){+.+.}, at: rdma_destroy_id+0x6f/0x440 [rdma_cm] but task is already holding lock: 00000000331b4e2c ((work_completion)(&queue->release_work)){+.+.}, at: process_one_work+0x3ed/0xa20 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 ((work_completion)(&queue->release_work)){+.+.}: process_one_work+0x474/0xa20 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #2 ((wq_completion)"events"){+.+.}: flush_workqueue+0xf3/0x970 nvmet_rdma_cm_handler+0x133d/0x1734 [nvmet_rdma] cma_ib_req_handler+0x72f/0xf90 [rdma_cm] cm_process_work+0x2e/0x110 [ib_cm] cm_req_handler+0x135b/0x1c30 [ib_cm] cm_work_handler+0x2b7/0x38cd [ib_cm] process_one_work+0x4ae/0xa20 nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: disconnected (10): status 0 id 0000000040357082 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 nvme nvme0: Reconnecting in 10 seconds... -> #1 (&id_priv->handler_mutex/1){+.+.}: __mutex_lock+0xfe/0xbe0 mutex_lock_nested+0x1b/0x20 cma_ib_req_handler+0x6aa/0xf90 [rdma_cm] cm_process_work+0x2e/0x110 [ib_cm] cm_req_handler+0x135b/0x1c30 [ib_cm] cm_work_handler+0x2b7/0x38cd [ib_cm] process_one_work+0x4ae/0xa20 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #0 (&id_priv->handler_mutex){+.+.}: lock_acquire+0xc5/0x200 __mutex_lock+0xfe/0xbe0 mutex_lock_nested+0x1b/0x20 rdma_destroy_id+0x6f/0x440 [rdma_cm] nvmet_rdma_release_queue_work+0x8e/0x1b0 [nvmet_rdma] process_one_work+0x4ae/0xa20 worker_thread+0x63/0x5a0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 Fixes: 777dc82395de ("nvmet-rdma: occasionally flush ongoing controller teardown") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-04block: Finish renaming REQ_DISCARD into REQ_OP_DISCARDBart Van Assche8-11/+11
Some time ago REQ_DISCARD was renamed into REQ_OP_DISCARD. Some comments and documentation files were not updated however. Update these comments and documentation files. See also commit 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code"). Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Mike Christie <mchristi@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-03cdrom: fix improper type cast, which can leat to information leak.Young_X1-1/+1
There is another cast from unsigned long to int which causes a bounds check to fail with specially crafted input. The value is then used as an index in the slot array in cdrom_slot_status(). This issue is similar to CVE-2018-16658 and CVE-2018-10940. Signed-off-by: Young_X <YangX92@hotmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-02pktcdvd: fix fall-through annotationGustavo A. R. Silva1-1/+1
Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-01nvme: take node locality into account when selecting a pathChristoph Hellwig3-28/+54
Make current_path an array with an entry for every possible node, and cache the best path on a per-node basis. Take the node distance into account when selecting it. This is primarily useful for dual-ported PCIe devices which are connected to PCIe root ports on different sockets. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Hannes Reinecke <hare@suse.com>
2018-10-01nvmet: don't split large I/Os unconditionallySagi Grimberg2-2/+8
If we know that the I/O size exceeds our inline bio vec, no point using it and split the rest to begin with. We could in theory reuse the inline bio and only allocate the bio_vec, but its really not worth optimizing for. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/OJames Smart3-2/+13
When an io is rejected by nvmf_check_ready() due to validation of the controller state, the nvmf_fail_nonready_command() will normally return BLK_STS_RESOURCE to requeue and retry. However, if the controller is dying or the I/O is marked for NVMe multipath, the I/O is failed so that the controller can terminate or so that the io can be issued on a different path. Unfortunately, as this reject point is before the transport has accepted the command, blk-mq ends up completing the I/O and never calls nvme_complete_rq(), which is where multipath may preserve or re-route the I/O. The end result is, the device user ends up seeing an EIO error. Example: single path connectivity, controller is under load, and a reset is induced. An I/O is received: a) while the reset state has been set but the queues have yet to be stopped; or b) after queues are started (at end of reset) but before the reconnect has completed. The I/O finishes with an EIO status. This patch makes the following changes: - Adds the HOST_PATH_ERROR pathing status from TP4028 - Modifies the reject point such that it appears to queue successfully, but actually completes the io with the new pathing status and calls nvme_complete_rq(). - nvme_complete_rq() recognizes the new status, avoids resetting the controller (likely was already done in order to get this new status), and calls the multipather to clear the current path that errored. This allows the next command (retry or new command) to select a new path if there is one. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01nvme-core: add async event trace helperChaitanya Kulkarni2-2/+37
This patch adds a new event for nvme async event notification. We print the async event in the decoded format when we recognize the event otherwise we just dump the result. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01nvme_fc: add 'nvme_discovery' sysfs attribute to fc transport deviceJames Smart1-9/+95
The fc transport device should allow for a rediscovery, as userspace might have lost the events. Example is udev events not handled during system startup. This patch add a sysfs entry 'nvme_discovery' on the fc class to have it replay all udev discovery events for all local port/remote port address pairs. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01nvmet_fc: support target port removal with nvmet layerJames Smart1-8/+120
Currently, if a targetport has been connected to via the nvmet config (in other words, the add_port() transport routine called, and the nvmet port pointer stored for using in upcalls on new io), and if the targetport is then removed (say the lldd driver decides to unload or fully reset its hardware) and then re-added (the lldd driver reloads or reinits its hardware), the port pointer has been lost so there's no way to continue to post commands up to nvmet via the transport port. Correct by allocating a small "port context" structure that will be linked to by the targetport. The context will save the targetport WWN's and the nvmet port pointer to use for it. Initial allocation will occur when the targetport is bound to via add_port. The context will be deallocated when remove_port() is called. If a targetport is removed while nvmet has the active port context, the targetport will be unlinked from the port context before removal. If a new targetport is registered, the port contexts without a binding are looked through and if the WWN's match (so it's the same as nvmet's port context) the port context is linked to the new target port. Thus new io can be received on the new targetport and operation resumes with nvmet. Additionally, this also resolves nvmet configuration changing out from underneath of the nvme-fc target port (for example: a nvmetcli clear). Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01nvme-fc: fix for a minor typosMilan P. Gandhi2-3/+3
Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01nvmet: remove redundant module prefixChaitanya Kulkarni1-1/+1
This patch removes the redundant module prefix used in the pr_err() when nvmet_get_smart_log_nsid() failed to find the namespace provided as a part of smart-log command. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01nvme: fix typo in nvme_identify_ns_descsMilan P. Gandhi1-1/+1
Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-09-30Linux 4.19-rc6v4.19-rc6Greg Kroah-Hartman1-1/+1
2018-09-30MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.cMiguel Ojeda1-1/+1
Commit 51c1e9b554c9 ("auxdisplay: Move panel.c to drivers/auxdisplay folder") moved the file, but the MAINTAINERS reference was not updated. Link: https://lore.kernel.org/lkml/20180928220131.31075-1-joe@perches.com/ Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
2018-09-29cpufreq: qcom-kryo: Fix section annotationsNathan Chancellor1-2/+2
There is currently a warning when building the Kryo cpufreq driver into the kernel image: WARNING: vmlinux.o(.text+0x8aa424): Section mismatch in reference from the function qcom_cpufreq_kryo_probe() to the function .init.text:qcom_cpufreq_kryo_get_msm_id() The function qcom_cpufreq_kryo_probe() references the function __init qcom_cpufreq_kryo_get_msm_id(). This is often because qcom_cpufreq_kryo_probe lacks a __init annotation or the annotation of qcom_cpufreq_kryo_get_msm_id is wrong. Remove the '__init' annotation from qcom_cpufreq_kryo_get_msm_id so that there is no more mismatch warning. Additionally, Nick noticed that the remove function was marked as '__init' when it should really be marked as '__exit'. Fixes: 46e2856b8e18 (cpufreq: Add Kryo CPU scaling driver) Fixes: 5ad7346b4ae2 (cpufreq: kryo: Add module remove and exit) Reported-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: 4.18+ <stable@vger.kernel.org> # 4.18+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-09-28perf/core: Add sanity check to deal with pinned event failureReinette Chatre1-0/+6
It is possible that a failure can occur during the scheduling of a pinned event. The initial portion of perf_event_read_local() contains the various error checks an event should pass before it can be considered valid. Ensure that the potential scheduling failure of a pinned event is checked for and have a credible error. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: fenghua.yu@intel.com Cc: tony.luck@intel.com Cc: acme@kernel.org Cc: gavin.hindman@intel.com Cc: jithu.joseph@intel.com Cc: dave.hansen@intel.com Cc: hpa@zytor.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com
2018-09-28blk-iolatency: keep track of previous windows statsJosef Bacik1-7/+13
We apply a smoothing to the scale changes in order to keep sawtoothy behavior from occurring. However our window for checking if we've missed our target can sometimes be lower than the smoothing interval (500ms), especially on faster drives like ssd's. In order to deal with this keep track of the running tally of the previous intervals that we threw away because we had already done a scale event recently. This is needed for the ssd case as these low latency drives will have bursts of latency, and if it happens to be ok for the window that directly follows the opening of the scale window we could unthrottle when previous windows we were missing our target. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28blk-iolatency: use a percentile approache for ssd'sJosef Bacik1-34/+145
We use an average latency approach for determining if we're missing our latency target. This works well for rotational storage where we have generally consistent latencies, but for ssd's and other low latency devices you have more of a spikey behavior, which means we often won't throttle misbehaving groups because a lot of IO completes at drastically faster times than our latency target. Instead keep track of how many IO's miss our target and how many IO's are done in our time window. If the p(90) latency is above our target then we know we need to throttle. With this change in place we are seeing the same throttling behavior with our testcase on ssd's as we see with rotational drives. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28blk-iolatency: deal with small samplesJosef Bacik1-1/+1
There is logic to keep cgroups that haven't done a lot of IO in the most recent scale window from being punished for over-active higher priority groups. However for things like ssd's where the windows are pretty short we'll end up with small numbers of samples, so 5% of samples will come out to 0 if there aren't enough. Make the floor 1 sample to keep us from improperly bailing out of scaling down. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28blk-iolatency: deal with nr_requests == 1Josef Bacik1-1/+1
Hitting the case where blk_queue_depth() returned 1 uncovered the fact that iolatency doesn't actually handle this case properly, it simply doesn't scale down anybody. For this case we should go straight into applying the time delay, which we weren't doing. Since we already limit the floor at 1 request this if statement is not needed, and this allows us to set our depth to 1 which allows us to apply the delay if needed. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28blk-iolatency: use q->nr_requests directlyJosef Bacik1-3/+3
We were using blk_queue_depth() assuming that it would return nr_requests, but we hit a case in production on drives that had to have NCQ turned off in order for them to not shit the bed which resulted in a qd of 1, even though the nr_requests was much larger. iolatency really only cares about requests we are allowed to queue up, as any io that get's onto the request list is going to be serviced soonish, so we want to be throttling before the bio gets onto the request list. To make iolatency work as expected, simply use q->nr_requests instead of blk_queue_depth() as that is what we actually care about. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28kyber: fix integer overflow of latency targets on 32-bitOmar Sandoval1-3/+3
NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long. However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a 32-bit long. Make sure all of the targets are calculated as 64-bit values. Fixes: 6e25cb01ea20 ("kyber: implement improved heuristics") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28xen/blkfront: correct purging of persistent grantsJuergen Gross1-2/+2
Commit a46b53672b2c2e3770b38a4abf90d16364d2584b ("xen/blkfront: cleanup stale persistent grants") introduced a regression as purged persistent grants were not pu into the list of free grants again. Correct that. Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"Jens Axboe1-1/+3
Fix didn't work for all cases, reverting to add a (hopefully) better fix. This reverts commit f151ba989d149bbdfc90e5405724bbea094f9b17. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28virtio-blk: modernize sysfs attribute creationHannes Reinecke1-29/+39
Use new-style DEVICE_ATTR_RO/DEVICE_ATTR_RW to create the sysfs attributes and register the disk with default sysfs attribute groups. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28zram: register default groups with device_add_disk()Hannes Reinecke1-21/+7
Register default sysfs groups during device_add_disk() to avoid a race condition with udev during startup. Signed-off-by: Hannes Reinecke <hare@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28aoe: register default groups with device_add_disk()Hannes Reinecke3-16/+7
Register default sysfs groups during device_add_disk() to avoid a race condition with udev during startup. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Ed L. Cachin <ed.cashin@acm.org> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28nvme: register ns_id attributes as default sysfs groupsHannes Reinecke4-92/+59
We should be registering the ns_id attribute as default sysfs attribute groups, otherwise we have a race condition between the uevent and the attributes appearing in sysfs. Suggested-by: Bart van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28block: genhd: add 'groups' argument to device_add_diskHannes Reinecke28-33/+43
Update device_add_disk() to take an 'groups' argument so that individual drivers can register a device with additional sysfs attributes. This avoids race condition the driver would otherwise have if these groups were to be created with sysfs_add_groups(). Signed-off-by: Martin Wilck <martin.wilck@suse.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28selftests/powerpc: Fix Makefiles for headers_install changeMichael Ellerman17-0/+17
Commit b2d35fa5fc80 ("selftests: add headers_install to lib.mk") introduced a requirement that Makefiles more than one level below the selftests directory need to define top_srcdir, but it didn't update any of the powerpc Makefiles. This broke building all the powerpc selftests with eg: make[1]: Entering directory '/src/linux/tools/testing/selftests/powerpc' BUILD_TARGET=/src/linux/tools/testing/selftests/powerpc/alignment; mkdir -p $BUILD_TARGET; make OUTPUT=$BUILD_TARGET -k -C alignment all make[2]: Entering directory '/src/linux/tools/testing/selftests/powerpc/alignment' ../../lib.mk:20: ../../../../scripts/subarch.include: No such file or directory make[2]: *** No rule to make target '../../../../scripts/subarch.include'. make[2]: Failed to remake makefile '../../../../scripts/subarch.include'. Makefile:38: recipe for target 'alignment' failed Fix it by setting top_srcdir in the affected Makefiles. Fixes: b2d35fa5fc80 ("selftests: add headers_install to lib.mk") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-28kyber: add tracepointsOmar Sandoval2-18/+130
When debugging Kyber, it's really useful to know what latencies we've been having, how the domain depths have been adjusted, and if we've actually been throttling. Add three tracepoints, kyber_latency, kyber_adjust, and kyber_throttled, to record that. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28kyber: implement improved heuristicsOmar Sandoval1-218/+279
Kyber's current heuristics have a few flaws: - It's based on the mean latency, but p99 latency tends to be more meaningful to anyone who cares about latency. The mean can also be skewed by rare outliers that the scheduler can't do anything about. - The statistics calculations are purely time-based with a short window. This works for steady, high load, but is more sensitive to outliers with bursty workloads. - It only considers the latency once an I/O has been submitted to the device, but the user cares about the time spent in the kernel, as well. These are shortcomings of the generic blk-stat code which doesn't quite fit the ideal use case for Kyber. So, this replaces the statistics with a histogram used to calculate percentiles of total latency and I/O latency, which we then use to adjust depths in a slightly more intelligent manner: - Sync and async writes are now the same domain. - Discards are a separate domain. - Domain queue depths are scaled by the ratio of the p99 total latency to the target latency (e.g., if the p99 latency is double the target latency, we will double the queue depth; if the p99 latency is half of the target latency, we can halve the queue depth). - We use the I/O latency to determine whether we should scale queue depths down: we will only scale down if any domain's I/O latency exceeds the target latency, which is an indicator of congestion in the device. These new heuristics are just as scalable as the heuristics they replace. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28kyber: don't make domain token sbitmap larger than necessaryOmar Sandoval1-13/+2
The domain token sbitmaps are currently initialized to the device queue depth or 256, whichever is larger, and immediately resized to the maximum depth for that domain (256, 128, or 64 for read, write, and other, respectively). The sbitmap is never resized larger than that, so it's unnecessary to allocate a bitmap larger than the maximum depth. Let's just allocate it to the maximum depth to begin with. This will use marginally less memory, and more importantly, give us a more appropriate number of bits per sbitmap word. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28block: export blk_stat_enable_accounting()Omar Sandoval1-0/+1
Kyber will need this in a future change if it is built as a module. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28block: move call of scheduler's ->completed_request() hookOmar Sandoval4-8/+8
Commit 4bc6339a583c ("block: move blk_stat_add() to __blk_mq_end_request()") consolidated some calls using ktime_get() so we'd only need to call it once. Kyber's ->completed_request() hook also calls ktime_get(), so let's move it to the same place, too. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27blk-mq: I/O and timer unplugs are inverted in blktraceIlya Dryomov1-2/+2
trace_block_unplug() takes true for explicit unplugs and false for implicit unplugs. schedule() unplugs are implicit and should be reported as timer unplugs. While correct in the legacy code, this has been inverted in blk-mq since 4.11. Cc: stable@vger.kernel.org Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27dax: Fix deadlock in dax_lock_mapping_entry()Jan Kara1-0/+1
When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will fail to unlock mapping->i_pages spinlock and thus immediately deadlock against itself when retrying to grab the entry lock again. Fix the problem by unlocking mapping->i_pages before retrying. Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()") Reported-by: Barret Rhoden <brho@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-09-27x86/boot: Fix kexec booting failure in the SEV bit detection codeKairui Song1-19/+0
Commit 1958b5fc4010 ("x86/boot: Add early boot support when running with SEV active") can occasionally cause system resets when kexec-ing a second kernel even if SEV is not active. That's because get_sev_encryption_bit() uses 32-bit rIP-relative addressing to read the value of enc_bit - a variable which caches a previously detected encryption bit position - but kexec may allocate the early boot code to a higher location, beyond the 32-bit addressing limit. In this case, garbage will be read and get_sev_encryption_bit() will return the wrong value, leading to accessing memory with the wrong encryption setting. Therefore, remove enc_bit, and thus get rid of the need to do 32-bit rIP-relative addressing in the first place. [ bp: massage commit message heavily. ] Fixes: 1958b5fc4010 ("x86/boot: Add early boot support when running with SEV active") Suggested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Kairui Song <kasong@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: linux-kernel@vger.kernel.org Cc: tglx@linutronix.de Cc: mingo@redhat.com Cc: hpa@zytor.com Cc: brijesh.singh@amd.com Cc: kexec@lists.infradead.org Cc: dyoung@redhat.com Cc: bhe@redhat.com Cc: ghook@redhat.com Link: https://lkml.kernel.org/r/20180927123845.32052-1-kasong@redhat.com
2018-09-27bcache: add separate workqueue for journal_write to avoid deadlockGuoju Fang3-3/+12
After write SSD completed, bcache schedules journal_write work to system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM flag. system_wq is also a bound wq, and there may be no idle kworker on current processor. Creating a new kworker may unfortunately need to reclaim memory first, by shrinking cache and slab used by vfs, which depends on bcache device. That's a deadlock. This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM flag. It's rescuer thread will work to avoid the deadlock. Signed-off-by: Guoju Fang <fangguoju@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27drm/amd/display: Fix Edid emulation for linuxBhawanpreet Lakha3-7/+137
[Why] EDID emulation didn't work properly for linux, as we stop programming if nothing is connected physically. [How] We get a flag from DRM when we want to do edid emulation. We check if this flag is true and nothing is connected physically, if so we only program the front end using VIRTUAL_SIGNAL. Signed-off-by: Bhawanpreet Lakha <Bhawanpreet.Lakha@amd.com> Reviewed-by: Harry Wentland <Harry.Wentland@amd.com> Acked-by: Leo Li <sunpeng.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-09-27drm/amd/display: Fix Vega10 lightup on S3 resumeRoman Li3-18/+1
[Why] There have been a few reports of Vega10 display remaining blank after S3 resume. The regression is caused by workaround for mode change on Vega10 - skip set_bandwidth if stream count is 0. As a result we skipped dispclk reset on suspend, thus on resume we may skip the clock update assuming it hasn't been changed. On some systems it causes display blank or 'out of range'. [How] Revert "drm/amd/display: Fix Vega10 black screen after mode change" Verified that it hadn't cause mode change regression. Signed-off-by: Roman Li <Roman.Li@amd.com> Reviewed-by: Sun peng Li <Sunpeng.Li@amd.com> Acked-by: Leo Li <sunpeng.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-09-27drm/amdgpu: Fix vce work queue was not cancelled when suspendRex Zhu2-3/+4
The vce cancel_delayed_work_sync never be called. driver call the function in error path. This caused the A+A suspend hang when runtime pm enebled. As we will visit the smu in the idle queue. this will cause smu hang because the dgpu has been suspend, and the dgpu also will be waked up. As the smu has been hang, so the dgpu resume will failed. Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Rex Zhu <Rex.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2018-09-27Revert "drm/panel: Add device_link from panel device to DRM device"Linus Walleij2-11/+0
This reverts commit 0c08754b59da5557532d946599854e6df28edc22. commit 0c08754b59da ("drm/panel: Add device_link from panel device to DRM device") creates a circular dependency under these circumstances: 1. The panel depends on dsi-host because it is MIPI-DSI child device. 2. dsi-host depends on the drm parent device (connector->dev->dev) this should be allowed. 3. drm parent dev (connector->dev->dev) depends on the panel after this patch. This makes the dependency circular and while it appears it does not affect any in-tree drivers (they do not seem to have dsi hosts depending on the same parent device) this does not seem right. As noted in a response from Andrzej Hajda, the intent is likely to make the panel dependent on the DRM device (connector->dev) not its parent. But we have no way of doing that since the DRM device doesn't contain any struct device on its own (arguably it should). Revert this until a proper approach is figured out. Cc: Jyri Sarha <jsarha@ti.com> Cc: Eric Anholt <eric@anholt.net> Cc: Andrzej Hajda <a.hajda@samsung.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Sean Paul <seanpaul@chromium.org> Link: https://patchwork.freedesktop.org/patch/msgid/20180927124130.9102-1-linus.walleij@linaro.org
2018-09-27xen/blkfront: When purging persistent grants, keep them in the bufferBoris Ostrovsky1-3/+1
Commit a46b53672b2c ("xen/blkfront: cleanup stale persistent grants") added support for purging persistent grants when they are not in use. As part of the purge, the grants were removed from the grant buffer, This eventually causes the buffer to become empty, with BUG_ON triggered in get_free_grant(). This can be observed even on an idle system, within 20-30 minutes. We should keep the grants in the buffer when purging, and only free the grant ref. Fixes: a46b53672b2c ("xen/blkfront: cleanup stale persistent grants") Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27clocksource/drivers/timer-atmel-pit: Properly handle error casesAlexandre Belloni1-6/+14
The smatch utility reports a possible leak: smatch warnings: drivers/clocksource/timer-atmel-pit.c:183 at91sam926x_pit_dt_init() warn: possible memory leak of 'data' Ensure data is freed before exiting with an error. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: stable@vger.kernel.org Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2018-09-27block: fix deadline elevator drain for zoned block devicesDamien Le Moal1-1/+1
When the deadline scheduler is used with a zoned block device, writes to a zone will be dispatched one at a time. This causes the warning message: deadline: forced dispatching is broken (nr_sorted=X), please report this to be displayed when switching to another elevator with the legacy I/O path while write requests to a zone are being retained in the scheduler queue. Prevent this message from being displayed when executing elv_drain_elevator() for a zoned block device. __blk_drain_queue() will loop until all writes are dispatched and completed, resulting in the desired elevator queue drain without extensive modifications to the deadline code itself to handle forced-dispatch calls. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Fixes: 8dc8146f9c92 ("deadline-iosched: Introduce zone locking support") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26blk-mq: Enable support for runtime power managementBart Van Assche2-6/+2
Now that the blk-mq core processes power management requests (marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable runtime power management for blk-mq. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26block: Make blk_get_request() block for non-PM requests while suspendedBart Van Assche2-34/+47
Instead of allowing requests that are not power management requests to enter the queue in runtime suspended status (RPM_SUSPENDED), make the blk_get_request() caller block. This change fixes a starvation issue: it is now guaranteed that power management requests will be executed no matter how many blk_get_request() callers are waiting. For blk-mq, instead of maintaining the q->nr_pending counter, rely on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a request finishes instead of only if the queue depth drops to zero. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26block: Allow unfreezing of a queue while requests are in progressBart Van Assche1-1/+1
A later patch will call blk_freeze_queue_start() followed by blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop to zero. Make sure that this doesn't cause a kernel warning to appear by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The former namely requires that the refcount it operates on is zero. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>