summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* block: Fix two comments that refer to .queue_rq() return valuesBart Van Assche2017-08-181-2/+2
| | | | | | | | | | | | | | | Since patch "blk-mq: switch .queue_rq return value to blk_status_t" .queue_rq() returns a BLK_STS_* value instead of a BLK_MQ_RQ_* value. Hence refer to the former in comments about .queue_rq() return values. Fixes: commit 39a70c76b89b ("blk-mq: clarify dispatch may not be drained/blocked by stopping queue") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: change the default nbd partitionsJosef Bacik2017-08-171-2/+2
| | | | | | | | | There's no reason to have partitions disabled for nbd by default, it costs us nothing to have it enabled and is just confusing/obnoxious to users who try to use partitions with nbd. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: allow device creation at a specific indexJosef Bacik2017-08-171-0/+9
| | | | | | | | | If users really want to use a particular index for their nbd device and it doesn't already exist there's no reason we can't just create it for them. Do this instead of erroring out. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* loop: fix to a race condition due to the early registration of deviceAnton Volkov2017-08-151-6/+8
| | | | | | | | | | | | | | | The early device registration made possible a race leading to allocations of disks with wrong minors. This patch moves the device registration further down the loop_init function to make the race infeasible. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Anton Volkov <avolkov@ispras.ru> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* cfq: Give a chance for arming slice idle timer in case of group_idleRitesh Harjani2017-08-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In below scenario blkio cgroup does not work as per their assigned weights :- 1. When the underlying device is nonrotational with a single HW queue with depth of >= CFQ_HW_QUEUE_MIN 2. When the use case is forming two blkio cgroups cg1(weight 1000) & cg2(wight 100) and two processes(file1 and file2) doing sync IO in their respective blkio cgroups. For above usecase result of fio (without this patch):- file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan 1 19:41:49 1970 write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec) <...> file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan 1 19:41:49 1970 write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec) <...> // both the process BW is equal even though they belong to diff. cgroups with weight of 1000(cg1) and 100(cg2) In above case (for non rotational NCQ devices), as soon as the request from cg1 is completed and even though it is provided with higher set_slice=10, because of CFQ algorithm when the driver tries to fetch the request, CFQ expires this group without providing any idle time nor weight priority and schedules another cfq group (in this case cg2). And thus both cfq groups(cg1 & cg2) keep alternating to get the disk time and hence loses the cgroup weight based scheduling. Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer) to arm the slice timer in case group_idle is enabled. In case if group_idle is also not required (including for nonrotational NCQ drives), we need to explicitly set group_idle = 0 from sysfs for such cases. With this patch result of fio(for above usecase) :- file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan 1 00:06:08 1970 write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec) <..> file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan 1 00:06:08 1970 write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec) <..> // In this processes BW is as per their respective cgroups weight. Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: boost throughput with flash-based non-queueing devicesPaolo Valente2017-08-111-10/+19
| | | | | | | | | | | | | | | | | | | When a queue associated with a process remains empty, there are cases where throughput gets boosted if the device is idled to await the arrival of a new I/O request for that queue. Currently, BFQ assumes that one of these cases is when the device has no internal queueing (regardless of the properties of the I/O being served). Unfortunately, this condition has proved to be too general. So, this commit refines it as "the device has no internal queueing and is rotational". This refinement provides a significant throughput boost with random I/O, on flash-based storage without internal queueing. For example, on a HiKey board, throughput increases by up to 125%, growing, e.g., from 6.9MB/s to 15.6MB/s with two or three random readers in parallel. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block,bfq: refactor device-idling logicPaolo Valente2017-08-112-62/+67
| | | | | | | | | | | | | | | The logic that decides whether to idle the device is scattered across three functions. Almost all of the logic is in the function bfq_bfqq_may_idle, but (1) part of the decision is made in bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may switch off idling regardless of the output of bfq_bfqq_may_idle. In addition, both bfq_update_idle_window and bfq_bfqq_must_idle make their decisions as a function of parameters that are used, for similar purposes, also in bfq_bfqq_may_idle. This commit addresses these issues by moving all the logic into bfq_bfqq_may_idle. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove unused syncfull/asyncfull queue flagsJens Axboe2017-08-102-33/+29
| | | | | | | | We haven't used these in years, but somehow the definitions still remained. Kill them, and renumber the QUEUE_FLAG_ space. We had a hole in the beginning of the space, too. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: enable checking two part inflight counts at the same timeJens Axboe2017-08-092-17/+33
| | | | | | | | | Modify blk_mq_in_flight() to count both a partition and root at the same time. Then we only have to call it once, instead of potentially looping the tags twice. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: provide internal in-flight variantJens Axboe2017-08-094-29/+77
| | | | | | | | | | We don't have to inc/dec some counter, since we can just iterate the tags. That makes inc/dec a noop, but means we have to iterate busy tags to get an in-flight count. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: make part_in_flight() take an array of two intsJens Axboe2017-08-094-9/+20
| | | | | | | | | | | | | | | Instead of returning the count that matches the partition, pass in an array of two ints. Index 0 will be filled with the inflight count for the partition in question, and index 1 will filled with the root inflight count, if the partition passed in is not the root. This is in preparation for being able to calculate both in one go. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: pass in queue to inflight accountingJens Axboe2017-08-0913-49/+64
| | | | | | | | | No functional change in this patch, just in preparation for basing the inflight mechanism on the queue in question. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq-tag: check for NULL rq when iterating tagsJens Axboe2017-08-091-2/+12
| | | | | | | | | | | | | | Since we introduced blk-mq-sched, the tags->rqs[] array has been dynamically assigned. So we need to check for NULL when iterating, since there's a window of time where the bit is set, but we haven't dynamically assigned the tags->rqs[] array position yet. This is perfectly safe, since the memory backing of the request is never going away while the device is alive. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* dm-crypt: don't mess with BIP_BLOCK_INTEGRITYChristoph Hellwig2017-08-091-3/+0
| | | | | | | | This flag is never set right after calling bio_integrity_alloc, so don't clear it and confuse the reader. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bio-integrity: move the bio integrity profile check earlier in ↵Christoph Hellwig2017-08-091-3/+3
| | | | | | | | | | bio_integrity_prep This makes the code more obvious, and moves the most likely branch first in the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* null_blk: make sure submit_queues > 0weiping zhang2017-08-071-2/+2
| | | | | | | set submit_queues to 1 by default, and make sure it's value > 0. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* null_blk: simplify logic for use_per_node_hctxweiping zhang2017-08-071-5/+2
| | | | | | | make sure submit_queues equal nr_online_nodes. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Add comment to submit_bio_wait()Jan Kara2017-08-021-0/+4
| | | | | | | | submit_bio_wait() does not consume bio reference. Add comment about that. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: add warning to __blk_mq_run_hw_queue() for ints disabledJens Axboe2017-08-011-0/+10
| | | | | | | | | | | | We recently had a bug in the IPR SCSI driver, where it would end up making the SCSI mid layer run the mq hardware queue with interrupts disabled. This isn't legal, since the software queue locking relies on never being grabbed from interrupt context. Additionally, drivers that set BLK_MQ_F_BLOCKING may schedule from this context. Add a WARN_ON_ONCE() to catch bad users up front. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: blk_mq_requeue_work() doesn't need to save IRQ flagsJens Axboe2017-07-291-3/+2
| | | | | | | We know we're in process context, so don't bother using the IRQ safe versions of the spin lock. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: DAC960: shut up format-overflow warningArnd Bergmann2017-07-291-4/+8
| | | | | | | | | | | | | | | | | | | gcc-7 points out that a large controller number would overflow the string length for the procfs name and the firmware version string: drivers/block/DAC960.c: In function 'DAC960_Probe': drivers/block/DAC960.c:6591:38: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=] drivers/block/DAC960.c: In function 'DAC960_V1_ReadControllerConfiguration': drivers/block/DAC960.c:1681:40: error: '%02d' directive writing between 2 and 3 bytes into a region of size between 2 and 5 [-Werror=format-overflow=] drivers/block/DAC960.c:1681:40: note: directive argument in the range [0, 255] drivers/block/DAC960.c:1681:3: note: 'sprintf' output between 10 and 14 bytes into a destination of size 12 Both of these seem appropriately sized, and using snprintf() instead of sprintf() improves this by ensuring that even incorrect data won't cause undefined behavior here. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: use standard blktrace API to output cgroup info for debug notesShaohua Li2017-07-295-24/+35
| | | | | | | | | | | | | | | | | Currently cfq/bfq/blk-throttle output cgroup info in trace in their own way. Now we have standard blktrace API for this, so convert them to use it. Note, this changes the behavior a little bit. cgroup info isn't output by default, we only do this with 'blk_cgroup' option enabled. cgroup info isn't output as a string by default too, we only do this with 'blk_cgname' option enabled. Also cgroup info is output in different position of the note string. I think these behavior changes aren't a big issue (actually we make trace data shorter which is good), since the blktrace note is solely for debugging. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blktrace: add an option to allow displaying cgroup pathShaohua Li2017-07-295-1/+52
| | | | | | | | | | | | By default we output cgroup id in blktrace. This adds an option to display cgroup path. Since get cgroup path is a relativly heavy operation, we don't enable it by default. with the option enabled, blktrace will output something like this: dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd] Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: always attach cgroup info into bioShaohua Li2017-07-292-6/+4
| | | | | | | | | | | | blkcg_bio_issue_check() already gets blkcg for a BIO. bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap operation. There is no point we don't attach the cgroup info into bio at blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup info. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blktrace: export cgroup info in traceShaohua Li2017-07-292-73/+161
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently blktrace isn't cgroup aware. blktrace prints out task name of current context, but the task of current context isn't always in the cgroup where the BIO comes from. We can't use task name to find out IO cgroup. For example, Writeback BIOs always comes from flusher thread but the BIOs are for different blk cgroups. Request could be requeued and dispatched from completely different tasks. MD/DM are another examples. This patch tries to fix the gap. We print out cgroup fhandle info in blktrace. Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and use a BPF program to filter out blktrace for a specific cgroup. We add a new 'blk_cgroup' trace option for blk tracer. It's default off. Application which doesn't know the new option isn't affected. When it's on, we output fhandle info right after blk_io_trace with an extra bit set in event action. So from application point of view, blktrace with the option will output new actions. I didn't change blk trace event yet, since I'm not sure if changing the trace event output is an ABI issue. If not, I'll do it later. Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* cgroup: export fhandle info for a cgroupShaohua Li2017-07-291-0/+8
| | | | | | | | | | | | | | Add an API to export cgroup fhandle info. We don't export a full 'struct file_handle', there are unrequired info. Sepcifically, cgroup is always a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle, we only need export the inode number and generation number just like what generic_fh_to_dentry does. And we can avoid the overhead of getting an inode too, since kernfs_node_id (ino and generation) has all the info required. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: add exportfs operationsShaohua Li2017-07-293-1/+70
| | | | | | | | | | | | Now we have the facilities to implement exportfs operations. The idea is cgroup can export the fhandle info to userspace, then userspace uses fhandle to find the cgroup name. Another example is userspace can get fhandle for a cgroup and BPF uses the fhandle to filter info for the cgroup. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: introduce kernfs_node_idShaohua Li2017-07-296-13/+21
| | | | | | | | | | | inode number and generation can identify a kernfs node. We are going to export the identification by exportfs operations, so put ino and generation into a separate structure. It's convenient when later patches use the identification. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: don't set dentry->d_fsdataShaohua Li2017-07-296-31/+27
| | | | | | | | | | | | | | When working on adding exportfs operations in kernfs, I found it's hard to initialize dentry->d_fsdata in the exportfs operations. Looks there is no way to do it without race condition. Look at the kernfs code closely, there is no point to set dentry->d_fsdata. inode->i_private already points to kernfs_node, and we can get inode from a dentry. So this patch just delete the d_fsdata usage. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: add an API to get kernfs node from inode numberShaohua Li2017-07-293-1/+69
| | | | | | | | | | | | | | | Add an API to get kernfs node from inode number. We will need this to implement exportfs operations. This API will be used in blktrace too later, so it should be as fast as possible. To make the API lock free, kernfs node is freed in RCU context. And we depend on kernfs_node count/ino number to filter out stale kernfs nodes. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: implement i_generationShaohua Li2017-07-293-1/+12
| | | | | | | | | | | | | | | | | | | | Set i_generation for kernfs inode. This is required to implement exportfs operations. The generation is 32-bit, so it's possible the generation wraps up and we find stale files. To reduce the posssibility, we don't reuse inode numer immediately. When the inode number allocation wraps, we increase generation number. In this way generation/inode number consist of a 64-bit number which is unlikely duplicated. This does make the idr tree more sparse and waste some memory. Since idr manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6 bits of the key. In a 100k inode kernfs, the worst case will have around 300k radix tree node. Each node is 576bytes, so the tree will use about ~150M memory. Sounds not too bad, if this really is a problem, we should find better data structure. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: use idr instead of ida to manage inode numberShaohua Li2017-07-292-6/+13
| | | | | | | | | | | kernfs uses ida to manage inode number. The problem is we can't get kernfs_node from inode number with ida. Switching to use idr, next patch will add an API to get kernfs_node from inode number. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'devicetree-fixes-for-4.13' of ↵Linus Torvalds2017-07-292-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull DeviceTree fixes from Rob Herring: "Two small DT fixes: - Fix error handling in of_irq_to_resource_table() due to of_irq_to_resource() error return changes. - Fix dtx_diff script due to dts include path changes" * tag 'devicetree-fixes-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: of: irq: fix of_irq_to_resource() error check scripts/dtc: dtx_diff - update include dts paths to match build
| * of: irq: fix of_irq_to_resource() error checkSergei Shtylyov2017-07-211-1/+1
| | | | | | | | | | | | | | | | | | | | of_irq_to_resource() has recently been fixed to return negative error #'s along with 0, however of_irq_to_resource_table() still only regards 0 as invalid IRQ -- fix it up. Fixes: 7a4228bbff76 ("of: irq: use of_irq_get() in of_irq_to_resource()") Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Rob Herring <robh@kernel.org>
| * scripts/dtc: dtx_diff - update include dts paths to match buildFrank Rowand2017-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | Update the cpp include flags for compiling device tree dts files to match the changes made to the kernel build process in commit d5d332d3f7e8 ("devicetree: Move include prefixes from arch to separate directory"). Cc: <stable@vger.kernel.org> # 4.12 Signed-off-by: Frank Rowand <frank.rowand@sony.com> Signed-off-by: Rob Herring <robh@kernel.org>
* | Merge tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds2017-07-282-6/+12
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client fixes from Anna Schumaker: "More NFS client bugfixes for 4.13. Most of these fix locking bugs that Ben and Neil noticed, but I also have a patch to fix one more access bug that was reported after last week. Stable fixes: - Fix a race where CB_NOTIFY_LOCK fails to wake a waiter - Invalidate file size when taking a lock to prevent corruption Other fixes: - Don't excessively generate tiny writes with fallocate - Use the raw NFS access mask in nfs4_opendata_access()" * tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter NFS: Optimize fallocate by refreshing mapping when needed. NFS: invalidate file size when taking a lock. NFS: Use raw NFS access mask in nfs4_opendata_access()
| * | NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiterBenjamin Coddington2017-07-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfs4_retry_setlk() sets the task's state to TASK_INTERRUPTIBLE within the same region protected by the wait_queue's lock after checking for a notification from CB_NOTIFY_LOCK callback. However, after releasing that lock, a wakeup for that task may race in before the call to freezable_schedule_timeout_interruptible() and set TASK_WAKING, then freezable_schedule_timeout_interruptible() will set the state back to TASK_INTERRUPTIBLE before the task will sleep. The result is that the task will sleep for the entire duration of the timeout. Since we've already set TASK_INTERRUPTIBLE in the locked section, just use freezable_schedule_timout() instead. Fixes: a1d617d8f134 ("nfs: allow blocking locks to be awoken by lock callbacks") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | NFS: Optimize fallocate by refreshing mapping when needed.NeilBrown2017-07-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | posix_fallocate() will allocate space in an NFS file by considering the last byte of every 4K block. If it is before EOF, it will read the byte and if it is zero, a zero is written out. If it is after EOF, the zero is unconditionally written. For the blocks beyond EOF, if NFS believes its cache is valid, it will expand these writes to write full pages, and then will merge the pages. This results if (typically) 1MB writes. If NFS believes its cache is not valid (particularly if NFS_INO_INVALID_DATA or NFS_INO_REVAL_PAGECACHE are set - see nfs_write_pageuptodate()), it will send the individual 1-byte writes. This results in (typically) 256 times as many RPC requests, and can be substantially slower. Currently nfs_revalidate_mapping() is only used when reading a file or mmapping a file, as these are times when the content needs to be up-to-date. Writes don't generally need the cache to be up-to-date, but writes beyond EOF can benefit, particularly in the posix_fallocate() case. So this patch calls nfs_revalidate_mapping() when writing beyond EOF - i.e. when there is a gap between the end of the file and the start of the write. If the cache is thought to be out of date (as happens after taking a file lock), this will cause a GETATTR, and the two flags mentioned above will be cleared. With this, posix_fallocate() on a newly locked file does not generate excessive tiny writes. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | NFS: invalidate file size when taking a lock.NeilBrown2017-07-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to commit ca0daa277aca ("NFS: Cache aggressively when file is open for writing"), NFS would revalidate, or invalidate, the file size when taking a lock. Since that commit it only invalidates the file content. If the file size is changed on the server while wait for the lock, the client will have an incorrect understanding of the file size and could corrupt data. This particularly happens when writing beyond the (supposed) end of file and can be easily be demonstrated with posix_fallocate(). If an application opens an empty file, waits for a write lock, and then calls posix_fallocate(), glibc will determine that the underlying filesystem doesn't support fallocate (assuming version 4.1 or earlier) and will write out a '0' byte at the end of each 4K page in the region being fallocated that is after the end of the file. NFS will (usually) detect that these writes are beyond EOF and will expand them to cover the whole page, and then will merge the pages. Consequently, NFS will write out large blocks of zeroes beyond where it thought EOF was. If EOF had moved, the pre-existing part of the file will be over-written. Locking should have protected against this, but it doesn't. This patch restores the use of nfs_zap_caches() which invalidated the cached attributes. When posix_fallocate() asks for the file size, the request will go to the server and get a correct answer. cc: stable@vger.kernel.org (v4.8+) Fixes: ca0daa277aca ("NFS: Cache aggressively when file is open for writing") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | NFS: Use raw NFS access mask in nfs4_opendata_access()Anna Schumaker2017-07-261-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bd8b2441742b ("NFS: Store the raw NFS access mask in the inode's access cache") changed how the access results are stored after an access() call. An NFS v4 OPEN might have access bits returned with the opendata, so we should use the NFS4_ACCESS values when determining the return value in nfs4_opendata_access(). Fixes: bd8b2441742b ("NFS: Store the raw NFS access mask in the inode's access cache") Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Tested-by: Takashi Iwai <tiwai@suse.de>
* | | Merge tag 'xfs-4.13-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2017-07-286-3/+39
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: - fix firstfsb variables that we left uninitialized, which could lead to locking problems. - check for NULL metadata buffer pointers before using them. - don't allow btree cursor manipulation if the btree block is corrupt. Better to just shut down. - fix infinite loop problems in quotacheck. - fix buffer overrun when validating directory blocks. - fix deadlock problem in bunmapi. * tag 'xfs-4.13-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix multi-AG deadlock in xfs_bunmapi xfs: check that dir block entries don't off the end of the buffer xfs: fix quotacheck dquot id overflow infinite loop xfs: check _alloc_read_agf buffer pointer before using xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write xfs: check _btree_check_block value
| * | | xfs: fix multi-AG deadlock in xfs_bunmapiChristoph Hellwig2017-07-261-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like in the allocator we must avoid touching multiple AGs out of order when freeing blocks, as freeing still locks the AGF and can cause the same AB-BA deadlocks as in the allocation path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: check that dir block entries don't off the end of the bufferDarrick J. Wong2017-07-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we're checking the entries in a directory buffer, make sure that the entry length doesn't push us off the end of the buffer. Found via xfs/388 writing ones to the length fields. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | xfs: fix quotacheck dquot id overflow infinite loopBrian Foster2017-07-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a dquot has an id of U32_MAX, the next lookup index increment overflows the uint32_t back to 0. This starts the lookup sequence over from the beginning, repeats indefinitely and results in a livelock. Update xfs_qm_dquot_walk() to explicitly check for the lookup overflow and exit the loop. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: check _alloc_read_agf buffer pointer before usingDarrick J. Wong2017-07-202-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some circumstances, _alloc_read_agf can return an error code of zero but also a null AGF buffer pointer. Check for this and jump out. Fixes-coverity-id: 1415250 Fixes-coverity-id: 1415320 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_writeDarrick J. Wong2017-07-202-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We must initialize the firstfsb parameter to _bmapi_write so that it doesn't incorrectly treat stack garbage as a restriction on which AGs it can search for free space. Fixes-coverity-id: 1402025 Fixes-coverity-id: 1415167 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | xfs: check _btree_check_block valueDarrick J. Wong2017-07-201-2/+4
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | Check the _btree_check_block return value for the firstrec and lastrec functions, since we have the ability to signal that the repositioning did not succeed. Fixes-coverity-id: 114067 Fixes-coverity-id: 114068 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2017-07-2816-49/+99
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: "s390: - SRCU fix PPC: - host crash fixes x86: - bugfixes, including making nested posted interrupts really work Generic: - tweaks to kvm_stat and to uevents" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: LAPIC: Fix reentrancy issues with preempt notifiers tools/kvm_stat: add '-f help' to get the available event list tools/kvm_stat: use variables instead of hard paths in help output KVM: nVMX: Fix loss of L2's NMI blocking state KVM: nVMX: Fix posted intr delivery when vcpu is in guest mode x86: irq: Define a global vector for nested posted interrupts KVM: x86: do mask out upper bits of PAE CR3 KVM: make pid available for uevents without debugfs KVM: s390: take srcu lock when getting/setting storage keys KVM: VMX: remove unused field KVM: PPC: Book3S HV: Fix host crash on changing HPT size KVM: PPC: Book3S HV: Enable TM before accessing TM registers
| * \ \ Merge branch 'kvm-ppc-fixes' of ↵Paolo Bonzini2017-07-262-1/+5
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master Two commits which fix host crashes. Signed-off-by: Paolo BOnzini <pbonzini@redhat.com>
| | * | | KVM: PPC: Book3S HV: Fix host crash on changing HPT sizePaul Mackerras2017-07-241-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f98a8bf9ee20 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size", 2016-12-20) changed the behaviour of the KVM_PPC_ALLOCATE_HTAB ioctl so that it now allocates a new HPT and new revmap array if there was a previously-allocated HPT of a different size from the size being requested. In this case, we need to reset the rmap arrays of the memslots, because the rmap arrays will contain references to HPTEs which are no longer valid. Worse, these references are also references to slots in the new revmap array (which parallels the HPT), and the new revmap array contains random contents, since it doesn't get zeroed on allocation. The effect of having these stale references to slots in the revmap array that contain random contents is that subsequent calls to functions such as kvmppc_add_revmap_chain will crash because they will interpret the non-zero contents of the revmap array as HPTE indexes and thus index outside of the revmap array. This leads to host crashes such as the following. [ 7072.862122] Unable to handle kernel paging request for data at address 0xd000000c250c00f8 [ 7072.862218] Faulting instruction address: 0xc0000000000e1c78 [ 7072.862233] Oops: Kernel access of bad area, sig: 11 [#1] [ 7072.862286] SMP NR_CPUS=1024 [ 7072.862286] NUMA [ 7072.862325] PowerNV [ 7072.862378] Modules linked in: kvm_hv vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iw_cxgb3 mlx5_ib ib_core ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel i2c_opal nfsd auth_rpcgss oid_registry [ 7072.863085] nfs_acl lockd grace sunrpc kvm_pr kvm xfs libcrc32c scsi_dh_alua dm_service_time radeon lpfc nvme_fc nvme_fabrics nvme_core scsi_transport_fc i2c_algo_bit tg3 drm_kms_helper ptp pps_core syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm dm_multipath i2c_core cxgb3 mlx5_core mdio [last unloaded: kvm_hv] [ 7072.863381] CPU: 72 PID: 56929 Comm: qemu-system-ppc Not tainted 4.12.0-kvm+ #59 [ 7072.863457] task: c000000fe29e7600 task.stack: c000001e3ffec000 [ 7072.863520] NIP: c0000000000e1c78 LR: c0000000000e2e3c CTR: c0000000000e25f0 [ 7072.863596] REGS: c000001e3ffef560 TRAP: 0300 Not tainted (4.12.0-kvm+) [ 7072.863658] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE,TM[E]> [ 7072.863667] CR: 44082882 XER: 20000000 [ 7072.863767] CFAR: c0000000000e2e38 DAR: d000000c250c00f8 DSISR: 42000000 SOFTE: 1 GPR00: c0000000000e2e3c c000001e3ffef7e0 c000000001407d00 d000000c250c00f0 GPR04: d00000006509fb70 d00000000b3d2048 0000000003ffdfb7 0000000000000000 GPR08: 00000001007fdfb7 00000000c000000f d0000000250c0000 000000000070f7bf GPR12: 0000000000000008 c00000000fdad000 0000000010879478 00000000105a0d78 GPR16: 00007ffaf4080000 0000000000001190 0000000000000000 0000000000010000 GPR20: 4001ffffff000415 d00000006509fb70 0000000004091190 0000000ee1881190 GPR24: 0000000003ffdfb7 0000000003ffdfb7 00000000007fdfb7 c000000f5c958000 GPR28: d00000002d09fb70 0000000003ffdfb7 d00000006509fb70 d00000000b3d2048 [ 7072.864439] NIP [c0000000000e1c78] kvmppc_add_revmap_chain+0x88/0x130 [ 7072.864503] LR [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0 [ 7072.864566] Call Trace: [ 7072.864594] [c000001e3ffef7e0] [c000001e3ffef830] 0xc000001e3ffef830 (unreliable) [ 7072.864671] [c000001e3ffef830] [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0 [ 7072.864751] [c000001e3ffef920] [d00000000b38d878] kvmppc_map_vrma+0x168/0x200 [kvm_hv] [ 7072.864831] [c000001e3ffef9e0] [d00000000b38a684] kvmppc_vcpu_run_hv+0x1284/0x1300 [kvm_hv] [ 7072.864914] [c000001e3ffefb30] [d00000000f465664] kvmppc_vcpu_run+0x44/0x60 [kvm] [ 7072.865008] [c000001e3ffefb60] [d00000000f461864] kvm_arch_vcpu_ioctl_run+0x114/0x290 [kvm] [ 7072.865152] [c000001e3ffefbe0] [d00000000f453c98] kvm_vcpu_ioctl+0x598/0x7a0 [kvm] [ 7072.865292] [c000001e3ffefd40] [c000000000389328] do_vfs_ioctl+0xd8/0x8c0 [ 7072.865410] [c000001e3ffefde0] [c000000000389be4] SyS_ioctl+0xd4/0x130 [ 7072.865526] [c000001e3ffefe30] [c00000000000b760] system_call+0x58/0x6c [ 7072.865644] Instruction dump: [ 7072.865715] e95b2110 793a0020 7b4926e4 7f8a4a14 409e0098 807c000c 786326e4 7c6a1a14 [ 7072.865857] 935e0008 7bbd0020 813c000c 913e000c <93a30008> 93bc000c 48000038 60000000 [ 7072.866001] ---[ end trace 627b6e4bf8080edc ]--- Note that to trigger this, it is necessary to use a recent upstream QEMU (or other userspace that resizes the HPT at CAS time), specify a maximum memory size substantially larger than the current memory size, and boot a guest kernel that does not support HPT resizing. This fixes the problem by resetting the rmap arrays when the old HPT is freed. Fixes: f98a8bf9ee20 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size") Cc: stable@vger.kernel.org # v4.11+ Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>