summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2017-02-2432-761/+986
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates and fixes from Jens Axboe: - NVMe updates and fixes that missed the first pull request. This includes bug fixes, and support for autonomous power management. - Fix from Christoph for missing clear of the request payload, causing a problem with (at least) the storvsc driver. - Further fixes for the queue/bdi life time issues from Jan. - The Kconfig mq scheduler update from me. - Fixing a use-after-free in dm-rq, spotted by Bart, introduced in this merge window. - Three fixes for nbd from Josef. - Bug fix from Omar, fixing a bug in sas transport code that oopses when bsg ioctls were used. From Omar. - Improvements to the queue restart and tag wait from from Omar. - Set of fixes for the sed/opal code from Scott. - Three trivial patches to cciss from Tobin * 'for-linus' of git://git.kernel.dk/linux-block: (41 commits) dm-rq: don't dereference request payload after ending request blk-mq-sched: separate mark hctx and queue restart operations blk-mq: use sbq wait queues instead of restart for driver tags block/sed-opal: Propagate original error message to userland. nvme/pci: re-check security protocol support after reset block/sed-opal: Introduce free_opal_dev to free the structure and clean up state nvme: detect NVMe controller in recent MacBooks nvme-rdma: add support for host_traddr nvmet-rdma: Fix error handling nvmet-rdma: use nvme cm status helper nvme-rdma: move nvme cm status helper to .h file nvme-fc: don't bother to validate ioccsz and iorcsz nvme/pci: No special case for queue busy on IO nvme/core: Fix race kicking freed request_queue nvme/pci: Disable on removal when disconnected nvme: Enable autonomous power state transitions nvme: Add a quirk mechanism that uses identify_ctrl nvme: make nvmf_register_transport require a create_ctrl callback nvme: Use CNS as 8-bit field and avoid endianness conversion nvme: add semicolon in nvme_command setting ...
| * dm-rq: don't dereference request payload after ending requestJens Axboe2017-02-241-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Bart reported a case where dm would crash with use-after-free poison. This is due to dm_softirq_done() accessing memory associated with a request after calling end_request on it. This is most visible on !blk-mq, since we free the memory immediately for that case. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: eb8db831be80 ("dm: always defer request allocation to the owner of the request_queue") Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq-sched: separate mark hctx and queue restart operationsOmar Sandoval2017-02-232-20/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart() after we dispatch requests left over on our hardware queue dispatch list. This is so we'll go back and dispatch requests from the scheduler. In this case, it's only necessary to restart the hardware queue that we are running; there's no reason to run other hardware queues just because we are using shared tags. So, split out blk_mq_sched_mark_restart() into two operations, one for just the hardware queue and one for the whole request queue. The core code only needs the hctx variant, but I/O schedulers will want to use both. This also requires adjusting blk_mq_sched_restart_queues() to always check the queue restart flag, not just when using shared tags. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: use sbq wait queues instead of restart for driver tagsOmar Sandoval2017-02-232-9/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 50e1dab86aa2 ("blk-mq-sched: fix starvation for multiple hardware queues and shared tags") fixed one starvation issue for shared tags. However, we can still get into a situation where we fail to allocate a tag because all tags are allocated but we don't have any pending requests on any hardware queue. One solution for this would be to restart all queues that share a tag map, but that really sucks. Ideally, we could just block and wait for a tag, but that isn't always possible from blk_mq_dispatch_rq_list(). However, we can still use the struct sbitmap_queue wait queues with a custom callback instead of blocking. This has a few benefits: 1. It avoids iterating over all hardware queues when completing an I/O, which the current restart code has to do. 2. It benefits from the existing rolling wakeup code. 3. It avoids punting to another thread just to have it block. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block/sed-opal: Propagate original error message to userland.Scott Bauer2017-02-231-2/+5
| | | | | | | | | | | | | | | | | | | | During an error on a comannd, ex: user provides wrong pw to unlock range, we will gracefully terminate the opal session. We want to propagate the original error to userland instead of the result of the session termination, which is almost always a success. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme/pci: re-check security protocol support after resetScott Bauer2017-02-231-7/+10
| | | | | | | | | | | | | | | | | | | | | | A device may change capabilities after each reset, e.g. due to a firmware upgrade. We should thus check for Security Send/Receive and OPAL support after each reset. Based on patches from Christoph and Keith. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block/sed-opal: Introduce free_opal_dev to free the structure and clean up stateScott Bauer2017-02-232-0/+35
| | | | | | | | | | | | | | | | | | Before we free the opal structure we need to clean up any saved locking ranges that the user had told us to unlock from a suspend. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme: detect NVMe controller in recent MacBooksDaniel Roschka2017-02-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Adds support for detection of the NVMe controller found in the following recent MacBooks: - Retina MacBook 2016 (MacBook9,1) - 13" MacBook Pro 2016 without Touch Bar (MacBook13,1) - 13" MacBook Pro 2016 with Touch Bar (MacBook13,2) Signed-off-by: Daniel Roschka <danielroschka@phoenitydawn.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme-rdma: add support for host_traddrMax Gurtovoy2017-02-221-2/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will enable the user to control the specific interface for connection establishment in case the host has more than 1 interface under the same subnet. E.g: Host interfaces configured as: - ib0 1.1.1.1/16 - ib1 1.1.1.2/16 Target interfaces configured as: - ib0 1.1.1.3/16 (listener interface) - ib1 1.1.1.4/16 the following connect command will go through host iface ib0 (default): nvme connect -t rdma -n testsubsystem -a 1.1.1.3 -s 1023 but the following command will go through host iface ib1: nvme connect -t rdma -n testsubsystem -a 1.1.1.3 -s 1023 -w 1.1.1.2 Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvmet-rdma: Fix error handlingChristophe JAILLET2017-02-221-1/+1
| | | | | | | | | | | | | | | | | | | | According to the preceeding goto, it is likely that 'out_destroy_sq' was expected here. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvmet-rdma: use nvme cm status helperMax Gurtovoy2017-02-221-2/+3
| | | | | | | | | | | | | | | | | | | | Also remove redundant debug prints. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme-rdma: move nvme cm status helper to .h fileMax Gurtovoy2017-02-222-22/+24
| | | | | | | | | | | | | | | | | | | | | | This will enable the usage for nvme rdma target. Also move from a lookup array to a switch statement. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme-fc: don't bother to validate ioccsz and iorcszJames Smart2017-02-221-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Discovery controllers don't set the values. They are in reserved areas of the Identify Controller data structure. Given the cmd completed, the minimal capsule sizes are supported, so no need to check nqn to detect discovery controllers and special case validations. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme/pci: No special case for queue busy on IOKeith Busch2017-02-221-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This driver previously required we have a special check for IO submitted to nvme IO queues that are temporarily suspended. That is no longer necessary since blk-mq provides a quiesce, so any IO that actually gets submitted to such a queue must be ended since the queue isn't going to start back up. This is fixing a condition where we have fewer IO queues after a controller reset. This may happen if the number of CPU's has changed, or controller firmware update changed the queue count, for example. While it may be possible to complete the IO on a different queue, the block layer does not provide a way to resubmit a request on a different hardware context once the request has entered the queue. We don't want these requests to be stuck indefinitely either, so ending them in error is our only option at the moment. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme/core: Fix race kicking freed request_queueKeith Busch2017-02-221-3/+3
| | | | | | | | | | | | | | | | | | | | | | If a namespace has already been marked dead, we don't want to kick the request_queue again since we may have just freed it from another thread. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme/pci: Disable on removal when disconnectedKeith Busch2017-02-221-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | If the device is not present, the driver should disable the queues immediately. Prior to this, the driver was relying on the watchdog timer to kill the queues if requests were outstanding to the device, and that just delays removal up to one second. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme: Enable autonomous power state transitionsAndy Lutomirski2017-02-223-0/+171
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NVMe devices can advertise multiple power states. These states can be either "operational" (the device is fully functional but possibly slow) or "non-operational" (the device is asleep until woken up). Some devices can automatically enter a non-operational state when idle for a specified amount of time and then automatically wake back up when needed. The hardware configuration is a table. For each state, an entry in the table indicates the next deeper non-operational state, if any, to autonomously transition to and the idle time required before transitioning. This patch teaches the driver to program APST so that each successive non-operational state will be entered after an idle time equal to 100% of the total latency (entry plus exit) associated with that state. The maximum acceptable latency is controlled using dev_pm_qos (e.g. power/pm_qos_latency_tolerance_us in sysfs); non-operational states with total latency greater than this value will not be used. As a special case, setting the latency tolerance to 0 will disable APST entirely. On hardware without APST support, the sysfs file will not be exposed. The latency tolerance for newly-probed devices is set by the module parameter nvme_core.default_ps_max_latency_us. In theory, the device can expose "default" APST table, but this doesn't seem to function correctly on my device (Samsung 950), nor does it seem particularly useful. There is also an optional mechanism by which a configuration can be "saved" so it will be automatically loaded on reset. This can be configured from userspace, but it doesn't seem useful to support in the driver. On my laptop, enabling APST seems to save nearly 1W. The hardware tables can be decoded in userspace with nvme-cli. 'nvme id-ctrl /dev/nvmeN' will show the power state table and 'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST configuration. This feature is quirked off on a known-buggy Samsung device. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme: Add a quirk mechanism that uses identify_ctrlAndy Lutomirski2017-02-222-0/+65
| | | | | | | | | | | | | | | | | | | | | | Currently, all NVMe quirks are based on PCI IDs. Add a mechanism to define quirks based on identify_ctrl's vendor id, model number, and/or firmware revision. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme: make nvmf_register_transport require a create_ctrl callbackJohannes Thumshirn2017-02-225-8/+10
| | | | | | | | | | | | | | | | | | | | | | nvmf_create_ctrl() relys on the presence of a create_crtl callback in the registered nvmf_transport_ops, so make nvmf_register_transport require one. Update the available call-sites as well to reflect these changes. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme: Use CNS as 8-bit field and avoid endianness conversionParav Pandit2017-02-224-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch defines CNS field as 8-bit field and avoids cpu_to/from_le conversions. Also initialize nvme_command cns value explicitly to NVME_ID_CNS_NS for readability (don't rely on the fact that NVME_ID_CNS_NS = 0). Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme: add semicolon in nvme_command settingMax Gurtovoy2017-02-221-2/+2
| | | | | | | | | | | | | | | | Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvmet: avoid dereferencing nvmet_reqMax Gurtovoy2017-02-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | No need to dereference req twice to get the cmd when we already have it stored in a local variable. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme: Make controller state visible via sysfsSagi Grimberg2017-02-221-0/+24
| | | | | | | | | | | | | | | | | | | | Easier for debugging and testing state machine transitions. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvmet: Make cntlid globally uniqueSagi Grimberg2017-02-223-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | We usually log the cntlid which is confusing in case we have multiple subsystems each with it's own cntlid ida. Instead make cntlid ida globally unique and log the initial association. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvmet_fc: cleanup of abort flag processing in fcp_op_doneJames Smart2017-02-221-5/+3
| | | | | | | | | | | | | | | | | | Cleanup of abort flag processing in fcp_op_done. References were unnecessary Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nvme: admin-cmd: fix spelling mistake: "Counld" -> "Could"Colin Ian King2017-02-221-1/+1
| | | | | | | | | | | | | | | | trivial fix to spelling mistake in pr_err message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: get rid of blk-mq default scheduler choice Kconfig entriesJens Axboe2017-02-223-59/+13
| | | | | | | | | | | | | | | | | | | | The wording in the entries were poor and not understandable by even deities. Kill the selection for default block scheduler, and impose a policy with sane defaults. Architected-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block/sed: Embed function data into the function sequenceJon Derrick2017-02-221-255/+163
| | | | | | | | | | | | | | | | | | | | By embedding the function data with the function sequence, we can eliminate the external function data and state variable code. It also made obvious some other small cleanups. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * cciss: Remove kmalloc castTobin C. Harding2017-02-221-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | Coccinelle emits a warning about casting the return value of kmalloc(). Coccinelle suggests removing the cast as do kerneljanitors. Remove cast from kmalloc() call. Signed-off-by: Tobin C. Harding <me@tobin.cc> Acked-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * cciss: Fix checkpatch OPEN_BRACETobin C. Harding2017-02-221-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | Checkpatch emits ERROR:OPEN_BRACE: that open brace { should be on the previous line. Move open brace to new line. Also add space after if/switch statement since we introduce more checkpatch errors if not fixed at the same time. Signed-off-by: Tobin C. Harding <me@tobin.cc> Acked-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * cciss: Fix checkpatch TRAILING_WHITESPACETobin C. Harding2017-02-221-85/+85
| | | | | | | | | | | | | | | | | | | | Checkpatch emits 85 trailing whitespace warnings. Remove trailing whitespace. Signed-off-by: Tobin C. Harding <me@tobin.cc> Acked-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block/sed: Check received header lengthsJon Derrick2017-02-221-14/+21
| | | | | | | | | | | | | | | | | | | | Add a buffer size check against discovery and response header lengths before we loop over their buffers. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block/sed: Add helper to qualify response tokensJon Derrick2017-02-221-36/+25
| | | | | | | | | | | | | | | | | | | | Add helper which verifies the response token is valid and matches the expected value. Merges token_type and response_get_token. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block/sed: Use ssize_t on atom parsers to return errorsJon Derrick2017-02-221-14/+14
| | | | | | | | | | | | | | | | | | | | | | The short atom parser can return an errno from decoding but does not currently return the error as a signed value. Convert all of the parsers to ssize_t. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * scsi_transport_sas: fix BSG ioctl memory corruptionOmar Sandoval2017-02-211-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The end_device and sas_host devices support BSG ioctls, but the request_queue allocated for them isn't set up to allocate the struct scsi_request payload. This leads to memory corruption in the call to scsi_req_init() in bsg_map_hdr(), since it will memset past the end of the allocated request. Fix it by setting ->cmd_size on the allocated request_queue. Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request") Signed-off-by: Omar Sandoval <osandov@fb.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Revalidate i_bdev reference in bd_aquire()Jan Kara2017-02-211-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a device gets removed, block device inode unhashed so that it is not used anymore (bdget() will not find it anymore). Later when a new device gets created with the same device number, we create new block device inode. However there may be file system device inodes whose i_bdev still points to the original block device inode and thus we get two active block device inodes for the same device. They will share the same gendisk so the only visible differences will be that page caches will not be coherent and BDIs will be different (the old block device inode still points to unregistered BDI). Fix the problem by checking in bd_acquire() whether i_bdev still points to active block device inode and re-lookup the block device if not. That way any open of a block device happening after the old device has been removed will get correct block device inode. Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Unhash also block device inode for the whole deviceJan Kara2017-02-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Iteration over partitions in del_gendisk() omits part0. Add bdev_unhash_inode() call for the whole device. Otherwise if the device number gets reused, bdev inode will be still associated with the old (stale) bdi. Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Move bdev_unhash_inode() after invalidate_partition()Jan Kara2017-02-211-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Move bdev_unhash_inode() after invalidate_partition() as invalidate_partition() looks up bdev and it cannot find the right bdev inode after bdev_unhash_inode() is called. Thus invalidate_partition() would not invalidate page cache of the previously used bdev. Also use part_devt() when calling bdev_unhash_inode() instead of manually creating the device number. Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nbd: cleanup workqueue on error properlyJosef Bacik2017-02-211-1/+3
| | | | | | | | | | | | | | | | If we fail to register the blockdev we need to make sure to destroy the recv workqueue. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nbd: set the logical and physical blocksize properlyJosef Bacik2017-02-211-23/+15
| | | | | | | | | | | | | | | | | | | | | | We noticed when trying to do O_DIRECT to an export on the server side that we were getting requests smaller than the 4k sectorsize of the device. This is because the client isn't setting the logical and physical blocksizes properly for the underlying device. Fix this up by setting the queue blocksizes and then calling bd_set_size. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * nbd: cleanup ioctl handlingJosef Bacik2017-02-211-137/+132
| | | | | | | | | | | | | | | | Break the ioctl handling out into helper functions, some of these things are getting pretty big and unwieldy. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * scsi: zero per-cmd driver data before each I/OChristoph Hellwig2017-02-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Without this drivers that don't clear the state themselves can see off effects. For example Hyper-V VMs using the storvsc driver will often hang during boot due to uncleared Test Unit Ready failures. Fixes: e9c787e6 ("scsi: allocate scsi_cmnd structures as part of struct request") Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Dexuan Cui <decui@microsoft.com> Tested-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2017-02-2440-226/+431
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There is a lot here. A lot of these changes result in subtle user visible differences in kernel behavior. I don't expect anything will care but I will revert/fix things immediately if any regressions show up. From Seth Forshee there is a continuation of the work to make the vfs ready for unpriviled mounts. We had thought the previous changes prevented the creation of files outside of s_user_ns of a filesystem, but it turns we missed the O_CREAT path. Ooops. Pavel Tikhomirov and Oleg Nesterov worked together to fix a long standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only children that are forked after the prctl are considered and not children forked before the prctl. The only known user of this prctl systemd forks all children after the prctl. So no userspace regressions will occur. Holding earlier forked children to the same rules as later forked children creates a semantic that is sane enough to allow checkpoing of processes that use this feature. There is a long delayed change by Nikolay Borisov to limit inotify instances inside a user namespace. Michael Kerrisk extends the API for files used to maniuplate namespaces with two new trivial ioctls to allow discovery of the hierachy and properties of namespaces. Konstantin Khlebnikov with the help of Al Viro adds code that when a network namespace exits purges it's sysctl entries from the dcache. As in some circumstances this could use a lot of memory. Vivek Goyal fixed a bug with stacked filesystems where the permissions on the wrong inode were being checked. I continue previous work on ptracing across exec. Allowing a file to be setuid across exec while being ptraced if the tracer has enough credentials in the user namespace, and if the process has CAP_SETUID in it's own namespace. Proc files for setuid or otherwise undumpable executables are now owned by the root in the user namespace of their mm. Allowing debugging of setuid applications in containers to work better. A bug I introduced with permission checking and automount is now fixed. The big change is to mark the mounts that the kernel initiates as a result of an automount. This allows the permission checks in sget to be safely suppressed for this kind of mount. As the permission check happened when the original filesystem was mounted. Finally a special case in the mount namespace is removed preventing unbounded chains in the mount hash table, and making the semantics simpler which benefits CRIU. The vfs fix along with related work in ima and evm I believe makes us ready to finish developing and merge fully unprivileged mounts of the fuse filesystem. The cleanups of the mount namespace makes discussing how to fix the worst case complexity of umount. The stacked filesystem fixes pave the way for adding multiple mappings for the filesystem uids so that efficient and safer containers can be implemented" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc/sysctl: Don't grab i_lock under sysctl_lock. vfs: Use upper filesystem inode in bprm_fill_uid() proc/sysctl: prune stale dentries during unregistering mnt: Tuck mounts under others instead of creating shadow/side mounts. prctl: propagate has_child_subreaper flag to every descendant introduce the walk_process_tree() helper nsfs: Add an ioctl() to return owner UID of a userns fs: Better permission checking for submounts exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction vfs: open() with O_CREAT should not create inodes with unknown ids nsfs: Add an ioctl() to return the namespace type proc: Better ownership of files for non-dumpable tasks in user namespaces exec: Remove LSM_UNSAFE_PTRACE_CAP exec: Test the ptracer's saved cred to see if the tracee can gain caps exec: Don't reset euid and egid when the tracee has CAP_SETUID inotify: Convert to using per-namespace limits
| * | proc/sysctl: Don't grab i_lock under sysctl_lock.Eric W. Biederman2017-02-211-13/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > This patch has locking problem. I've got lockdep splat under LTP. > > [ 6633.115456] ====================================================== > [ 6633.115502] [ INFO: possible circular locking dependency detected ] > [ 6633.115553] 4.9.10-debug+ #9 Tainted: G L > [ 6633.115584] ------------------------------------------------------- > [ 6633.115627] ksm02/284980 is trying to acquire lock: > [ 6633.115659] (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80 > [ 6633.115834] but task is already holding lock: > [ 6633.115882] (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110 > [ 6633.116026] which lock already depends on the new lock. > [ 6633.116026] > [ 6633.116080] > [ 6633.116080] the existing dependency chain (in reverse order) is: > [ 6633.116117] > -> #2 (sysctl_lock){+.+...}: > -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}: > -> #0 (&sb->s_type->i_lock_key#4){+.+...}: > > d_lock nests inside i_lock > sysctl_lock nests inside d_lock in d_compare > > This patch adds i_lock nesting inside sysctl_lock. Al Viro <viro@ZenIV.linux.org.uk> replied: > Once ->unregistering is set, you can drop sysctl_lock just fine. So I'd > try something like this - use rcu_read_lock() in proc_sys_prune_dcache(), > drop sysctl_lock() before it and regain after. Make sure that no inodes > are added to the list ones ->unregistering has been set and use RCU list > primitives for modifying the inode list, with sysctl_lock still used to > serialize its modifications. > > Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing > igrab() is safe there. Since we don't drop inode reference until after we'd > passed beyond it in the list, list_for_each_entry_rcu() should be fine. I agree with Al Viro's analsysis of the situtation. Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | vfs: Use upper filesystem inode in bprm_fill_uid()Vivek Goyal2017-02-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now bprm_fill_uid() uses inode fetched from file_inode(bprm->file). This in turn returns inode of lower filesystem (in a stacked filesystem setup). I was playing with modified patches of shiftfs posted by james bottomley and realized that through shiftfs setuid bit does not take effect. And reason being that we fetch uid/gid from inode of lower fs (and not from shiftfs inode). And that results in following checks failing. /* We ignore suid/sgid if there are no mappings for them in the ns */ if (!kuid_has_mapping(bprm->cred->user_ns, uid) || !kgid_has_mapping(bprm->cred->user_ns, gid)) return; uid/gid fetched from lower fs inode might not be mapped inside the user namespace of container. So we need to look at uid/gid fetched from upper filesystem (shiftfs in this particular case) and these should be mapped and setuid bit can take affect. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | proc/sysctl: prune stale dentries during unregisteringKonstantin Khlebnikov2017-02-134-19/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently unregistering sysctl table does not prune its dentries. Stale dentries could slowdown sysctl operations significantly. For example, command: # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done creates a millions of stale denties around sysctls of loopback interface: # sysctl fs.dentry-state fs.dentry-state = 25812579 24724135 45 0 0 0 All of them have matching names thus lookup have to scan though whole hash chain and call d_compare (proc_sys_compare) which checks them under system-wide spinlock (sysctl_lock). # time sysctl -a > /dev/null real 1m12.806s user 0m0.016s sys 1m12.400s Currently only memory reclaimer could remove this garbage. But without significant memory pressure this never happens. This patch collects sysctl inodes into list on sysctl table header and prunes all their dentries once that table unregisters. Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > On 10.02.2017 10:47, Al Viro wrote: >> how about >> the matching stats *after* that patch? > > dcache size doesn't grow endlessly, so stats are fine > > # sysctl fs.dentry-state > fs.dentry-state = 92712 58376 45 0 0 0 > > # time sysctl -a &>/dev/null > > real 0m0.013s > user 0m0.004s > sys 0m0.008s Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | mnt: Tuck mounts under others instead of creating shadow/side mounts.Eric W. Biederman2017-02-034-63/+111
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ever since mount propagation was introduced in cases where a mount in propagated to parent mount mountpoint pair that is already in use the code has placed the new mount behind the old mount in the mount hash table. This implementation detail is problematic as it allows creating arbitrary length mount hash chains. Furthermore it invalidates the constraint maintained elsewhere in the mount code that a parent mount and a mountpoint pair will have exactly one mount upon them. Making it hard to deal with and to talk about this special case in the mount code. Modify mount propagation to notice when there is already a mount at the parent mount and mountpoint where a new mount is propagating to and place that preexisting mount on top of the new mount. Modify unmount propagation to notice when a mount that is being unmounted has another mount on top of it (and no other children), and to replace the unmounted mount with the mount on top of it. Move the MNT_UMUONT test from __lookup_mnt_last into __propagate_umount as that is the only call of __lookup_mnt_last where MNT_UMOUNT may be set on any mount visible in the mount hash table. These modifications allow: - __lookup_mnt_last to be removed. - attach_shadows to be renamed __attach_mnt and its shadow handling to be removed. - commit_tree to be simplified - copy_tree to be simplified The result is an easier to understand tree of mounts that does not allow creation of arbitrary length hash chains in the mount hash table. The result is also a very slight userspace visible difference in semantics. The following two cases now behave identically, where before order mattered: case 1: (explicit user action) B is a slave of A mount something on A/a , it will propagate to B/a and than mount something on B/a case 2: (tucked mount) B is a slave of A mount something on B/a and than mount something on A/a Histroically umount A/a would fail in case 1 and succeed in case 2. Now umount A/a succeeds in both configurations. This very small change in semantics appears if anything to be a bug fix to me and my survey of userspace leads me to believe that no programs will notice or care of this subtle semantic change. v2: Updated to mnt_change_mountpoint to not call dput or mntput and instead to decrement the counts directly. It is guaranteed that there will be other references when mnt_change_mountpoint is called so this is safe. v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt As the locking in fs/namespace.c changed between v2 and v3. v4: Reworked the logic in propagate_mount_busy and __propagate_umount that detects when a mount completely covers another mount. v5: Removed unnecessary tests whose result is alwasy true in find_topper and attach_recursive_mnt. v6: Document the user space visible semantic difference. Cc: stable@vger.kernel.org Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind") Tested-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | prctl: propagate has_child_subreaper flag to every descendantPavel Tikhomirov2017-02-032-3/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If process forks some children when it has is_child_subreaper flag enabled they will inherit has_child_subreaper flag - first group, when is_child_subreaper is disabled forked children will not inherit it - second group. So child-subreaper does not reparent all his descendants when their parents die. Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. To simplify these we can add a propagation of has_child_subreaper flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child- subreaper to setup has_child_subreaper flag. In common cases when process like systemd first sets itself to be a child-subreaper and only after that forks its services, we will have zero-length list of descendants to walk. Testing with binary subtree of 2^15 processes prctl took < 0.007 sec and has shown close to linear dependency(~0.2 * n * usec) on lower numbers of processes. Moreover, I doubt someone intentionaly pre-forks the children whitch should reparent to init before becoming subreaper, because some our ancestor migh have had is_child_subreaper flag while forking our sub-tree and our childs will all inherit has_child_subreaper flag, and we have no way to influence it. And only way to check if we have no has_child_subreaper flag is to create some childs, kill them and see where they will reparent to. Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing seems to be the same. Optimize: a) When descendant already has has_child_subreaper flag all his subtree has it too already. * for a) to be true need to move has_child_subreaper inheritance under the same tasklist_lock with adding task to its ->real_parent->children as without it process can inherit zero has_child_subreaper, then we set 1 to it's parent flag, check that parent has no more children, and only after child with wrong flag is added to the tree. * Also make these inheritance more clear by using real_parent instead of current, as on clone(CLONE_PARENT) if current has is_child_subreaper and real_parent has no is_child_subreaper or has_child_subreaper, child will have has_child_subreaper flag set without actually having a subreaper in it's ancestors. b) When some descendant is child_reaper, it's subtree is in different pidns from us(original child-subreaper) and processes from other pidns will never reparent to us. So we can skip their(a,b) subtree from walk. v2: switch to walk_process_tree() general helper, move has_child_subreaper inheritance v3: remove csr_descendant leftover, change current to real_parent in has_child_subreaper inheritance v4: small commit message fix Fixes: ebec18a6d3aa ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision") Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | introduce the walk_process_tree() helperOleg Nesterov2017-02-032-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the new helper to walk the process tree, the next patch adds a user. Note that it visits the group leaders only, proc_visitor can do for_each_thread itself or we can trivially extend walk_process_tree() to do this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * | Merge branch 'nsfs-discovery'Eric W. Biederman2017-02-032-2/+20
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Michael Kerrisk <<mtk.manpages@gmail.com> writes: I would like to write code that discovers the namespace setup on a live system. The NS_GET_PARENT and NS_GET_USERNS ioctl() operations added in Linux 4.9 provide much of what I want, but there are still a couple of small pieces missing. Those pieces are added with this patch series. Here's an example program that makes use of the new ioctl() operations. 8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x--- /* ns_capable.c (C) 2016 Michael Kerrisk, <mtk.manpages@gmail.com> Licensed under the GNU General Public License v2 or later. Test whether a process (identified by PID) might (subject to LSM checks) have capabilities in a namespace (identified by a /proc/PID/ns/xxx file). */ } while (0) exit(EXIT_FAILURE); } while (0) /* Display capabilities sets of process with specified PID */ static void show_cap(pid_t pid) { cap_t caps; char *cap_string; caps = cap_get_pid(pid); if (caps == NULL) errExit("cap_get_proc"); cap_string = cap_to_text(caps, NULL); if (cap_string == NULL) errExit("cap_to_text"); printf("Capabilities: %s\n", cap_string); } /* Obtain the effective UID pf the process 'pid' by scanning its /proc/PID/file */ static uid_t get_euid_of_process(pid_t pid) { char path[PATH_MAX]; char line[1024]; int uid; snprintf(path, sizeof(path), "/proc/%ld/status", (long) pid); FILE *fp; fp = fopen(path, "r"); if (fp == NULL) errExit("fopen-/proc/PID/status"); for (;;) { if (fgets(line, sizeof(line), fp) == NULL) { /* Should never happen... */ fprintf(stderr, "Failure scanning %s\n", path); exit(EXIT_FAILURE); } if (strstr(line, "Uid:") == line) { sscanf(line, "Uid: %*d %d %*d %*d", &uid); return uid; } } } int main(int argc, char *argv[]) { int ns_fd, userns_fd, pid_userns_fd; int nstype; int next_fd; struct stat pid_stat; struct stat target_stat; char *pid_str; pid_t pid; char path[PATH_MAX]; if (argc < 2) { fprintf(stderr, "Usage: %s PID [ns-file]\n", argv[0]); fprintf(stderr, "\t'ns-file' is a /proc/PID/ns/xxxx file; " "if omitted, use the namespace\n" "\treferred to by standard input " "(file descriptor 0)\n"); exit(EXIT_FAILURE); } pid_str = argv[1]; pid = atoi(pid_str); if (argc <= 2) { ns_fd = STDIN_FILENO; } else { ns_fd = open(argv[2], O_RDONLY); if (ns_fd == -1) errExit("open-ns-file"); } /* Get the relevant user namespace FD, which is 'ns_fd' if 'ns_fd' refers to a user namespace, otherwise the user namespace that owns 'ns_fd' */ nstype = ioctl(ns_fd, NS_GET_NSTYPE); if (nstype == -1) errExit("ioctl-NS_GET_NSTYPE"); if (nstype == CLONE_NEWUSER) { userns_fd = ns_fd; } else { userns_fd = ioctl(ns_fd, NS_GET_USERNS); if (userns_fd == -1) errExit("ioctl-NS_GET_USERNS"); } /* Obtain 'stat' info for the user namespace of the specified PID */ snprintf(path, sizeof(path), "/proc/%s/ns/user", pid_str); pid_userns_fd = open(path, O_RDONLY); if (pid_userns_fd == -1) errExit("open-PID"); if (fstat(pid_userns_fd, &pid_stat) == -1) errExit("fstat-PID"); /* Get 'stat' info for the target user namesapce */ if (fstat(userns_fd, &target_stat) == -1) errExit("fstat-PID"); /* If the PID is in the target user namespace, then it has whatever capabilities are in its sets. */ if (pid_stat.st_dev == target_stat.st_dev && pid_stat.st_ino == target_stat.st_ino) { printf("PID is in target namespace\n"); printf("Subject to LSM checks, it has the following capabilities\n"); show_cap(pid); exit(EXIT_SUCCESS); } /* Otherwise, we need to walk through the ancestors of the target user namespace to see if PID is in an ancestor namespace */ for (;;) { int f; next_fd = ioctl(userns_fd, NS_GET_PARENT); if (next_fd == -1) { /* The error here should be EPERM... */ if (errno != EPERM) errExit("ioctl-NS_GET_PARENT"); printf("PID is not in an ancestor namespace\n"); printf("It has no capabilities in the target namespace\n"); exit(EXIT_SUCCESS); } if (fstat(next_fd, &target_stat) == -1) errExit("fstat-PID"); /* If the 'stat' info for this user namespace matches the 'stat' * info for 'next_fd', then the PID is in an ancestor namespace */ if (pid_stat.st_dev == target_stat.st_dev && pid_stat.st_ino == target_stat.st_ino) break; /* Next time round, get the next parent */ f = userns_fd; userns_fd = next_fd; close(f); } /* At this point, we found that PID is in an ancestor of the target user namespace, and 'userns_fd' refers to the immediate descendant user namespace of PID in the chain of user namespaces from PID to the target user namespace. If the effective UID of PID matches the owner UID of descendant user namespace, then PID has all capabilities in the descendant namespace(s); otherwise, it just has the capabilities that are in its sets. */ uid_t owner_uid, uid; if (ioctl(userns_fd, NS_GET_OWNER_UID, &owner_uid) == -1) { perror("ioctl-NS_GET_OWNER_UID"); exit(EXIT_FAILURE); } uid = get_euid_of_process(pid); printf("PID is in an ancestor namespace\n"); if (owner_uid == uid) { printf("And its effective UID matches the owner " "of the namespace\n"); printf("Subject to LSM checks, PID has all capabilities in " "that namespace!\n"); } else { printf("But its effective UID does not match the owner " "of the namespace\n"); printf("Subject to LSM checks, it has the following capabilities\n"); show_cap(pid); } exit(EXIT_SUCCESS); } 8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x--- Michael Kerrisk (2): nsfs: Add an ioctl() to return the namespace type nsfs: Add an ioctl() to return owner UID of a userns fs/nsfs.c | 13 +++++++++++++ include/uapi/linux/nsfs.h | 9 +++++++-- 2 files changed, 20 insertions(+), 2 deletions(-)