summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' into for-4.12/blockJens Axboe2017-04-07455-3005/+6229
|\ | | | | | | | | | | | | | | | | | | We've added a considerable amount of fixes for stalls and issues with the blk-mq scheduling in the 4.11 series since forking off the for-4.12/block branch. We need to do improvements on top of that for 4.12, so pull in the previous fixes to make our lives easier going forward. Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: Restart a single queue if tag sets are sharedBart Van Assche2017-04-074-27/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | To improve scalability, if hardware queues are shared, restart a single hardware queue in round-robin fashion. Rename blk_mq_sched_restart_queues() to reflect the new semantics. Remove blk_mq_sched_mark_restart_queue() because this function has no callers. Remove flag QUEUE_FLAG_RESTART because this patch removes the code that uses this flag. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * dm rq: Avoid that request processing stalls sporadicallyBart Van Assche2017-04-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running the srp-test software I noticed that request processing stalls sporadically at the beginning of a test, namely when mkfs is run against a dm-mpath device. Every time when that happened the following command was sufficient to resume request processing: echo run >/sys/kernel/debug/block/dm-0/state This patch avoids that such request processing stalls occur. The test I ran is as follows: while srp-test/run_tests -d -r 30 -t 02-mq; do :; done Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Jens Axboe <axboe@fb.com>
| * scsi: Avoid that SCSI queues get stuckBart Van Assche2017-04-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block driver that implements that function is responsible for rerunning the hardware queue once requests can be queued again successfully. commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq() for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions that are intended to rerun a busy queue such that these examine all hardware queues instead of only stopped queues. Since no other functions than scsi_internal_device_block() and scsi_internal_device_unblock() should ever stop or restart a SCSI queue, change the blk_mq_delay_queue() call into a blk_mq_delay_run_hw_queue() call. Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues") Fixes: commit 7e79dadce222 ("blk-mq: stop hardware queue in blk_mq_delay_queue()") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Long Li <longli@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: Introduce blk_mq_delay_run_hw_queue()Bart Van Assche2017-04-072-2/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a function that runs a hardware queue unconditionally after a delay. Note: there is already a function that stops and restarts a hardware queue after a delay, namely blk_mq_delay_queue(). This function will be used in the next patch in this series. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Long Li <longli@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: remap queues when adding/removing hardware queuesOmar Sandoval2017-04-071-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the behavior that drivers expect. However, commit 4e68a011428a changed blk_mq_queue_reinit() to not remap queues for the case of CPU hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap queues as well. This breaks, for example, NBD's multi-connection mode, leaving the added hardware queues unused. Fix it by making blk_mq_update_nr_hw_queues() explicitly remap the queues. Fixes: 4e68a011428a ("blk-mq: don't redistribute hardware queues on a CPU hotplug event") Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq-sched: fix crash in switch error pathOmar Sandoval2017-04-076-48/+67
| | | | | | | | | | | | | | | | | | | | | | In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall back to the original scheduler. However, at this point, we've already torn down the original scheduler's tags, so this causes a crash. Doing the fallback like the legacy elevator path is much harder for mq, so fix it by just falling back to none, instead. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq-sched: set up scheduler tags when bringing up new queuesOmar Sandoval2017-04-073-1/+35
| | | | | | | | | | | | | | | | | | | | If a new hardware queue is added at runtime, we don't allocate scheduler tags for it, leading to a crash. This hooks up the scheduler framework to blk_mq_{init,exit}_hctx() to make sure everything gets properly initialized/freed. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq-sched: refactor scheduler initializationOmar Sandoval2017-04-073-59/+57
| | | | | | | | | | | | | | | | Preparation cleanup for the next couple of fixes, push blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: use the right hctx when getting a driver tag failsOmar Sandoval2017-04-073-17/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While dispatching requests, if we fail to get a driver tag, we mark the hardware queue as waiting for a tag and put the requests on a hctx->dispatch list to be run later when a driver tag is freed. However, blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware queues if using a single-queue scheduler with a multiqueue device. If blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we are processing. This means we end up using the hardware queue of the previous request, which may or may not be the same as that of the current request. If it isn't, the wrong hardware queue will end up waiting for a tag, and the requests will be on the wrong dispatch list, leading to a hang. The fix is twofold: 1. Make sure we save which hardware queue we were trying to get a request for in blk_mq_get_driver_tag() regardless of whether it succeeds or not. 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a blk_mq_hw_queue to make it clear that it must handle multiple hardware queues, since I've already messed this up on a couple of occasions. This didn't appear in testing with nvme and mq-deadline because nvme has more driver tags than the default number of scheduler tags. However, with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * Merge branch 'nvme-4.11-rc' of git://git.infradead.org/nvme into for-linusJens Axboe2017-04-044-12/+12
| |\ | | | | | | | | | | | | | | | | | | Sagi writes: We have one spec mis-match fix from Roland and several sparse fixes from Christoph.
| | * nvmet: fix byte swap in nvmet_parse_io_cmdChristoph Hellwig2017-04-021-1/+1
| | | | | | | | | | | | | | | | | | | | | We need to do arithmetics after byte swapping, not before. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvmet: fix byte swap in nvmet_execute_write_zeroesChristoph Hellwig2017-04-021-1/+1
| | | | | | | | | | | | | | | | | | | | | The length field in the Write Zeroes command is a 16-bit field. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvmet: add missing byte swap in nvmet_get_smart_logChristoph Hellwig2017-04-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | In this case entirely harmless as it's all-ones, but still nice to shut up sparse. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme: add missing byte swap in nvme_setup_discardChristoph Hellwig2017-04-021-1/+1
| | | | | | | | | | | | | | | | | | Fixes: b35ba01e ("nvme: support ranged discard requests") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme: Correct NVMF enum values to match NVMe-oF rev 1.0Roland Dreier2017-04-021-8/+8
| |/ | | | | | | | | | | | | | | | | | | | | | | | | The enum values for QPTYPE, PRTYPE and CMS are off by 1 from the values defined in figure 42 of the NVM Express over Fabrics 1.0: http://www.nvmexpress.org/wp-content/uploads/NVMe_over_Fabrics_1_0_Gold_20160605-1.pdf Fix our enums to match the final spec. Signed-off-by: Roland Dreier <roland@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| * block: do not put mq context in blk_mq_alloc_request_hctxMinchan Kim2017-03-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't get sw context so we don't need to put the context with blk_mq_put_ctx. Unless, we will see preempt counter underflow. Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * Merge branch 'for-rc' of ↵Linus Torvalds2017-03-302-19/+34
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux Pull thermal management fixes from Zhang Rui: - Fix a potential deadlock in cpu_cooling driver, which was introduced in 4.11-rc1. (Matthew Wilcox) - Fix the cpu_cooling and devfreq_cooling code to handle possible error return value from OPP calls, together with three minor fixes in the same patch series. (Viresh Kumar) * 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: thermal: cpu_cooling: Check OPP for errors thermal: cpu_cooling: Replace dev_warn with dev_err thermal: devfreq: Check OPP for errors thermal: devfreq_cooling: Replace dev_warn with dev_err thermal: devfreq: Simplify expression thermal: Fix potential deadlock in cpu_cooling
| | * thermal: cpu_cooling: Check OPP for errorsViresh Kumar2017-03-131-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible for dev_pm_opp_find_freq_exact() to return errors. It was all fine earlier as dev_pm_opp_get_voltage() had a check within it to check for invalid OPPs, but dev_pm_opp_put() doesn't have any similar checks and the callers need to make sure OPP is valid before calling them. Also update the later dev_warn_ratelimited() to not print the error message as the OPP is guaranteed to be valid now. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| | * thermal: cpu_cooling: Replace dev_warn with dev_errViresh Kumar2017-03-131-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | There isn't much the user can do on seeing these warnings, as the hardware is actually okay. dev_err suits much better here. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| | * thermal: devfreq: Check OPP for errorsViresh Kumar2017-03-131-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible for dev_pm_opp_find_freq_exact() to return errors. It was all fine earlier as dev_pm_opp_get_voltage() had a check within it to check for invalid OPPs, but dev_pm_opp_put() doesn't have any similar checks and the callers need to make sure OPP is valid before calling them. Also update the later dev_warn_ratelimited() to not print the error message as the OPP is guaranteed to be valid now. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| | * thermal: devfreq_cooling: Replace dev_warn with dev_errViresh Kumar2017-03-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | There isn't much the user can do on seeing this warning, as the hardware is actually okay. dev_err suits much better here. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| | * thermal: devfreq: Simplify expressionViresh Kumar2017-03-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to check for IS_ERR() as we are looking for a very particular error value here. Drop the first check. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| | * thermal: Fix potential deadlock in cpu_coolingMatthew Wilcox2017-03-131-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cooling_list_lock is covering not just cpufreq_dev_count, but also the calls to cpufreq_register_notifier() and cpufreq_unregister_notifier(). Since cooling_list_lock is also used within cpufreq_thermal_notifier(), lockdep reports a potential deadlock. Fix it by testing the condition under cooling_list_lock and dropping the lock before calling cpufreq_register_notifier(). And variable cpufreq_dev_count is removed at the same time, because it's no longer needed after the fix. Fixes: ae606089621e ("thermal: convert cpu_cooling to use an IDA") Reported-and-Tested-by: Russell King <linux@armlinux.org.uk> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| * | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2017-03-292-36/+107
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "Five fixes for this series: - a fix from me to ensure that blk-mq drivers that terminate IO in their ->queue_rq() handler by returning QUEUE_ERROR don't stall with a scheduler enabled. - four nbd fixes from Josef and Ratna, fixing various problems that are critical enough to go in for this cycle. They have been well tested" * 'for-linus' of git://git.kernel.dk/linux-block: nbd: replace kill_bdev() with __invalidate_device() nbd: set queue timeout properly nbd: set rq->errors to actual error code nbd: handle ERESTARTSYS properly blk-mq: include errors in did_work calculation
| | * | nbd: replace kill_bdev() with __invalidate_device()Ratna Manoj Bolla2017-03-241-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a filesystem is mounted on a nbd device and on a disconnect, because of kill_bdev(), and resetting bdev size to zero, buffer_head mappings are getting destroyed under mounted filesystem. After a bdev size reset(i.e bdev->bd_inode->i_size = 0) on a disconnect, followed by a sys_umount(), generic_shutdown_super()->... ->__sync_blockdev()->... -blkdev_writepages()->... ->do_invalidatepage()->... -discard_buffer() is discarding superblock buffer_head assumed to be in mapped state by ext4_commit_super(). [mlin: ported to 4.11-rc2] Signed-off-by: Ratna Manoj Bolla <manoj.br@gmail.com Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| | * | nbd: set queue timeout properlyJosef Bacik2017-03-241-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't just set the timeout on the tagset, we have to set it on the queue as it would have been setup already at this point. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| | * | nbd: set rq->errors to actual error codeJosef Bacik2017-03-241-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've been relying on the block layer to assume rq->errors being set translates into -EIO. I noticed in testing that sometimes this isn't true, and really there's not much of a reason to have a counter instead of just using -EIO. So set it properly so we don't leak random numbers to unsuspecting victims. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| | * | nbd: handle ERESTARTSYS properlyJosef Bacik2017-03-241-26/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can submit IO in a processes context, which means there can be pending signals. This isn't a fatal error for NBD, but it does require some finesse. If the signal happens before we transmit anything then we are ok, just requeue the request and carry on. However if we've done a partial transmit we can't allow anything else to be transmitted on this socket until we transmit the remaining part of the request. Deal with this by keeping track of how much we've sent for the current request, and if we get an ERESTARTSYS during any part of our transmission save the state of that request and requeue the IO. If anybody tries to submit a request that isn't our pending request then requeue that request until we are able to service the one that is pending. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| | * | blk-mq: include errors in did_work calculationJens Axboe2017-03-241-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we return true in blk_mq_dispatch_rq_list() if we queued IO successfully, but we really want to return whether or not the we made progress. Progress includes if we got an error return. If we don't, this can lead to a hang in blk_mq_sched_dispatch_requests() when a driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of manually ending the IO in error and return BLK_MQ_QUEUE_OK. Tested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | Merge branch 'apw' (xfrm_user fixes)Linus Torvalds2017-03-291-1/+8
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge xfrm_user validation fixes from Andy Whitcroft: "Two patches we are applying to Ubuntu for XFRM_MSG_NEWAE validation issue reported by ZDI. The first of these is the primary fix, and the second is for a more theoretical issue that Kees pointed out when reviewing the first" * emailed patches from Andy Whitcroft <apw@canonical.com>: xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
| | * | | xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harderAndy Whitcroft2017-03-291-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to wrapping issues. To ensure we are correctly ensuring that the two ESN structures are the same size compare both the overall size as reported by xfrm_replay_state_esn_len() and the internal length are the same. CVE-2017-7184 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_windowAndy Whitcroft2017-03-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a new xfrm state is created during an XFRM_MSG_NEWSA call we validate the user supplied replay_esn to ensure that the size is valid and to ensure that the replay_window size is within the allocated buffer. However later it is possible to update this replay_esn via a XFRM_MSG_NEWAE call. There we again validate the size of the supplied buffer matches the existing state and if so inject the contents. We do not at this point check that the replay_window is within the allocated memory. This leads to out-of-bounds reads and writes triggered by netlink packets. This leads to memory corruption and the potential for priviledge escalation. We already attempt to validate the incoming replay information in xfrm_new_ae() via xfrm_replay_verify_len(). This confirms that the user is not trying to change the size of the replay state buffer which includes the replay_esn. It however does not check the replay_window remains within that buffer. Add validation of the contained replay_window. CVE-2017-7184 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | Merge branch 'regset' (PTRACE_SETREGSET data leakage)Linus Torvalds2017-03-295-50/+23
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge PTRACE_SETREGSET leakage fixes from Dave Martin: "This series is the collection of fixes I proposed on this topic, that have not yet appeared upstream or in the stable branches, The issue can leak kernel stack, but doesn't appear to allow userspace to attack the kernel directly. The affected architectures are c6x, h8300, metag, mips and sparc. [ Mark Salter points out that c6x has no MMU or other mechanism to prevent userspace access to kernel code or data on c6x, but it doesn't hurt to clean that case up too. ] The bugs arise from use of user_regset_copyin(). Users of user_regset_copyin() can work in one of two ways: 1) Copy directly to thread_struct or equivalent. (This seems to be the design assumption of the regset API, and is the most common approach.) 2) Copy to a local variable and then transfer to thread_struct. (A significant minority of cases.) Buggy code typically involves approach 2" * emailed patches from Dave Martin <Dave.Martin@arm.com>: sparc/ptrace: Preserve previous registers for short regset write mips/ptrace: Preserve previous registers for short regset write metag/ptrace: Reject partial NT_METAG_RPIPE writes metag/ptrace: Provide default TXSTATUS for short NT_PRSTATUS metag/ptrace: Preserve previous registers for short regset write h8300/ptrace: Fix incorrect register transfer count c6x/ptrace: Remove useless PTRACE_SETREGSET implementation
| | * | | | sparc/ptrace: Preserve previous registers for short regset writeDave Martin2017-03-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET to fill all the registers, the thread's old registers are preserved. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | mips/ptrace: Preserve previous registers for short regset writeDave Martin2017-03-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET to fill all the registers, the thread's old registers are preserved. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | metag/ptrace: Reject partial NT_METAG_RPIPE writesDave Martin2017-03-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's not clear what behaviour is sensible when doing partial write of NT_METAG_RPIPE, so just don't bother. This patch assumes that userspace will never rely on a partial SETREGSET in this case, since it's not clear what should happen anyway. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Acked-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | metag/ptrace: Provide default TXSTATUS for short NT_PRSTATUSDave Martin2017-03-291-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET to fill TXSTATUS, a well-defined default value is used, based on the task's current value. Suggested-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | metag/ptrace: Preserve previous registers for short regset writeDave Martin2017-03-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET to fill all the registers, the thread's old registers are preserved. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Acked-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | h8300/ptrace: Fix incorrect register transfer countDave Martin2017-03-291-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | regs_set() and regs_get() are vulnerable to an off-by-1 buffer overrun if CONFIG_CPU_H8S is set, since this adds an extra entry to register_offset[] but not to user_regs_struct. So, iterate over user_regs_struct based on its actual size, not based on the length of register_offset[]. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| | * | | | c6x/ptrace: Remove useless PTRACE_SETREGSET implementationDave Martin2017-03-291-41/+0
| | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gpr_set won't work correctly and can never have been tested, and the correct behaviour is not clear due to the endianness-dependent task layout. So, just remove it. The core code will now return -EOPNOTSUPPORT when trying to set NT_PRSTATUS on this architecture until/unless a correct implementation is supplied. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2017-03-282-10/+18
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull virtio fixes from Michael Tsirkin: "Fixes to multiple issues in virtio. Most notably a regression fix for crashes reported by Fedora users. Hibernate is still reportedly broken, working on it" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio_balloon: prevent uninitialized variable use virtio-balloon: use actual number of stats for stats queue buffers virtio_balloon: init 1st buffer in stats vq virtio_pci: fix out of bound access for msix_names
| | * | | | virtio_balloon: prevent uninitialized variable useArnd Bergmann2017-03-281-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The latest gcc-7.0.1 snapshot reports a new warning: virtio/virtio_balloon.c: In function 'update_balloon_stats': virtio/virtio_balloon.c:258:26: error: 'events[2]' is used uninitialized in this function [-Werror=uninitialized] virtio/virtio_balloon.c:260:26: error: 'events[3]' is used uninitialized in this function [-Werror=uninitialized] virtio/virtio_balloon.c:261:56: error: 'events[18]' is used uninitialized in this function [-Werror=uninitialized] virtio/virtio_balloon.c:262:56: error: 'events[17]' is used uninitialized in this function [-Werror=uninitialized] This seems absolutely right, so we should add an extra check to prevent copying uninitialized stack data into the statistics. >From all I can tell, this has been broken since the statistics code was originally added in 2.6.34. Fixes: 9564e138b1f6 ("virtio: Add memory statistics reporting to the balloon driver (V4)") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| | * | | | virtio-balloon: use actual number of stats for stats queue buffersLadi Prosek2017-03-281-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The virtio balloon driver contained a not-so-obvious invariant that update_balloon_stats has to update exactly VIRTIO_BALLOON_S_NR counters in order to send valid stats to the host. This commit fixes it by having update_balloon_stats return the actual number of counters, and its callers use it when pushing buffers to the stats virtqueue. Note that it is still out of spec to change the number of counters at run-time. "Driver MUST supply the same subset of statistics in all buffers submitted to the statsq." Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| | * | | | virtio_balloon: init 1st buffer in stats vqLadi Prosek2017-03-281-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When init_vqs runs, virtio_balloon.stats is either uninitialized or contains stale values. The host updates its state with garbage data because it has no way of knowing that this is just a marker buffer used for signaling. This patch updates the stats before pushing the initial buffer. Alternative fixes: * Push an empty buffer in init_vqs. Not easily done with the current virtio implementation and violates the spec "Driver MUST supply the same subset of statistics in all buffers submitted to the statsq". * Push a buffer with invalid tags in init_vqs. Violates the same spec clause, plus "invalid tag" is not really defined. Note: the spec says: When using the legacy interface, the device SHOULD ignore all values in the first buffer in the statsq supplied by the driver after device initialization. Note: Historically, drivers supplied an uninitialized buffer in the first buffer. Unfortunately QEMU does not seem to implement the recommendation even for the legacy interface. Cc: stable@vger.kernel.org Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| | * | | | virtio_pci: fix out of bound access for msix_namesJason Wang2017-03-281-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fedora has received multiple reports of crashes when running 4.11 as a guest https://bugzilla.redhat.com/show_bug.cgi?id=1430297 https://bugzilla.redhat.com/show_bug.cgi?id=1434462 https://bugzilla.kernel.org/show_bug.cgi?id=194911 https://bugzilla.redhat.com/show_bug.cgi?id=1433899 The crashes are not always consistent but they are generally some flavor of oops or GPF in virtio related code. Multiple people have done bisections (Thank you Thorsten Leemhuis and Richard W.M. Jones) and found this commit to be at fault 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 is the first bad commit commit 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 Author: Christoph Hellwig <hch@lst.de> Date: Sun Feb 5 18:15:19 2017 +0100 virtio_pci: use shared interrupts for virtqueues The issue seems to be an out of bounds access to the msix_names array corrupting kernel memory. Fixes: 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") Reported-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Richard W.M. Jones <rjones@redhat.com> Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
| * | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2017-03-2812-35/+153
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: "All x86-specific, apart from some arch-independent syzkaller fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: cleanup the page tracking SRCU instance KVM: nVMX: fix nested EPT detection KVM: pci-assign: do not map smm memory slot pages in vt-d page tables KVM: kvm_io_bus_unregister_dev() should never fail KVM: VMX: Fix enable VPID conditions KVM: nVMX: Fix nested VPID vmx exec control KVM: x86: correct async page present tracepoint kvm: vmx: Flush TLB when the APIC-access address changes KVM: x86: use pic/ioapic destructor when destroy vm KVM: x86: check existance before destroy KVM: x86: clear bus pointer when destroyed KVM: Documentation: document MCE ioctls KVM: nVMX: don't reset kvm mmu twice PTP: fix ptr_ret.cocci warnings kvm: fix usage of uninit spinlock in avic_vm_destroy() KVM: VMX: downgrade warning on unexpected exit code
| | * | | | | KVM: x86: cleanup the page tracking SRCU instancePaolo Bonzini2017-03-283-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SRCU uses a delayed work item. Skip cleaning it up, and the result is use-after-free in the work item callbacks. Reported-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: 0eb05bf290cfe8610d9680b49abef37febd1c38a Reviewed-by: Xiao Guangrong <xiaoguangrong.eric@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * | | | | KVM: nVMX: fix nested EPT detectionLadi Prosek2017-03-281-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nested_ept_enabled flag introduced in commit 7ca29de2136 was not computed correctly. We are interested only in L1's EPT state, not the the combined L0+L1 value. In particular, if L0 uses EPT but L1 does not, nested_ept_enabled must be false to make sure that PDPSTRs are loaded based on CR3 as usual, because the special case described in 26.3.2.4 Loading Page-Directory- Pointer-Table Entries does not apply. Fixes: 7ca29de21362 ("KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT") Cc: qemu-stable@nongnu.org Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * | | | | KVM: pci-assign: do not map smm memory slot pages in vt-d page tablesHerongguang (Stephen)2017-03-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | or VM memory are not put thus leaked in kvm_iommu_unmap_memslots() when destroy VM. This is consistent with current vfio implementation. Signed-off-by: herongguang <herongguang.he@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>