summaryrefslogtreecommitdiffstats
path: root/fs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'locks-v5.5-1' of ↵Linus Torvalds2019-12-291-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull /proc/locks formatting fix from Jeff Layton: "This is a trivial fix for a _very_ long standing bug in /proc/locks formatting. Ordinarily, I'd wait for the merge window for something like this, but it is making it difficult to validate some overlayfs fixes. I've also gone ahead and marked this for stable" * tag 'locks-v5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: locks: print unsigned ino in /proc/locks
| * locks: print unsigned ino in /proc/locksAmir Goldstein2019-12-291-1/+1
| | | | | | | | | | | | | | | | An ino is unsigned, so display it as such in /proc/locks. Cc: stable@vger.kernel.org Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
* | Merge tag '5.5-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2019-12-293-10/+56
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull cifs fixes from Steve French: "One performance fix for large directory searches, and one minor style cleanup noticed by Clang" * tag '5.5-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Optimize readdir on reparse points cifs: Adjust indentation in smb2_open_file
| * | cifs: Optimize readdir on reparse pointsPaulo Alcantara (SUSE)2019-12-232-9/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When listing a directory with thounsands of files and most of them are reparse points, we simply marked all those dentries for revalidation and then sending additional (compounded) create/getinfo/close requests for each of them. Instead, upon receiving a response from an SMB2_QUERY_DIRECTORY (FileIdFullDirectoryInformation) command, the directory entries that have a file attribute of FILE_ATTRIBUTE_REPARSE_POINT will contain an EaSize field with a reparse tag in it, so we parse it and mark the dentry for revalidation only if it is a DFS or a symlink. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | cifs: Adjust indentation in smb2_open_fileNathan Chancellor2019-12-231-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clang warns: ../fs/cifs/smb2file.c:70:3: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] if (oparms->tcon->use_resilient) { ^ ../fs/cifs/smb2file.c:66:2: note: previous statement is here if (rc) ^ 1 warning generated. This warning occurs because there is a space after the tab on this line. Remove it so that the indentation is consistent with the Linux kernel coding style and clang no longer warns. Fixes: 592fafe644bf ("Add resilienthandles mount parm") Link: https://github.com/ClangBuiltLinux/linux/issues/826 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
* | Merge tag 'io_uring-5.5-20191226' of git://git.kernel.dk/linux-blockLinus Torvalds2019-12-272-343/+357
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: - Removal of now unused busy wqe list (Hillf) - Add cond_resched() to io-wq work processing (Hillf) - And then the series that I hinted at from last week, which removes the sqe from the io_kiocb and keeps all sqe handling on the prep side. This guarantees that an opcode can't do the wrong thing and read the sqe more than once. This is unchanged from last week, no issues have been observed with this in testing. Hence I really think we should fold this into 5.5. * tag 'io_uring-5.5-20191226' of git://git.kernel.dk/linux-block: io-wq: add cond_resched() to worker thread io-wq: remove unused busy list from io_sqe io_uring: pass in 'sqe' to the prep handlers io_uring: standardize the prep methods io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler io_uring: move all prep state for IORING_OP_CONNECT to prep handler io_uring: add and use struct io_rw for read/writes io_uring: use u64_to_user_ptr() consistently
| * io-wq: add cond_resched() to worker threadHillf Danton2019-12-241-0/+2
| | | | | | | | | | | | | | | | Reschedule the current IO worker to cut the risk that it is becoming a cpu hog. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io-wq: remove unused busy list from io_sqeHillf Danton2019-12-231-8/+0
| | | | | | | | | | | | | | | | | | | | Commit e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items") added a list for io workers in addition to the free and busy lists, not only making worker walk cleaner, but leaving the busy list unused. Let's remove it. Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: pass in 'sqe' to the prep handlersJens Axboe2019-12-201-242/+251
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This moves the prep handlers outside of the opcode handlers, and allows us to pass in the sqe directly. If the sqe is non-NULL, it means that the request should be prepared for the first time. With the opcode handlers not having access to the sqe at all, we are guaranteed that the prep handler has setup the request fully by the time we get there. As before, for opcodes that need to copy in more data then the io_kiocb allows for, the io_async_ctx holds that info. If a prep handler is invoked with req->io set, it must use that to retain information for later. Finally, we can remove io_kiocb->sqe as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: standardize the prep methodsJens Axboe2019-12-201-65/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently have a mix of use cases. Most of the newer ones are pretty uniform, but we have some older ones that use different calling calling conventions. This is confusing. For the opcodes that currently rely on the req->io->sqe copy saving them from reuse, add a request type struct in the io_kiocb command union to store the data they need. Prepare for all opcodes having a standard prep method, so we can call it in a uniform fashion and outside of the opcode handler. This is in preparation for passing in the 'sqe' pointer, rather than storing it in the io_kiocb. Once we have uniform prep handlers, we can leave all the prep work to that part, and not even pass in the sqe to the opcode handler. This ensures that we don't reuse sqe data inadvertently. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: read 'count' for IORING_OP_TIMEOUT in prep handlerJens Axboe2019-12-201-3/+8
| | | | | | | | | | | | | | | | Add the count field to struct io_timeout, and ensure the prep handler has read it. Timeout also needs an async context always, set it up in the prep handler if we don't have one. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handlerJens Axboe2019-12-201-31/+33
| | | | | | | | | | | | | | | | Add struct io_sr_msg in our io_kiocb per-command union, and ensure that the send/recvmsg prep handlers have grabbed what they need from the SQE by the time prep is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: move all prep state for IORING_OP_CONNECT to prep handlerJens Axboe2019-12-201-18/+22
| | | | | | | | | | | | | | Add struct io_connect in our io_kiocb per-command union, and ensure that io_connect_prep() has grabbed what it needs from the SQE. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: add and use struct io_rw for read/writesJens Axboe2019-12-201-46/+50
| | | | | | | | | | | | | | | | | | | | | | Put the kiocb in struct io_rw, and add the addr/len for the request as well. Use the kiocb->private field for the buffer index for fixed reads and writes. Any use of kiocb->ki_filp is flipped to req->file. It's the same thing, and less confusing. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: use u64_to_user_ptr() consistentlyJens Axboe2019-12-201-9/+7
| | | | | | | | | | | | | | | | | | We use it in some spots, but not consistently. Convert the rest over, makes it easier to read as well. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2019-12-235-2/+16
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull vfs fixes from Al Viro: "Eric's s_inodes softlockup fixes + Jan's fix for recent regression from pipe rework" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: call fsnotify_sb_delete after evict_inodes fs: avoid softlockups in s_inodes iterators pipe: Fix bogus dereference in iov_iter_alignment()
| * | fs: call fsnotify_sb_delete after evict_inodesEric Sandeen2019-12-182-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a filesystem is unmounted, we currently call fsnotify_sb_delete() before evict_inodes(), which means that fsnotify_unmount_inodes() must iterate over all inodes on the superblock looking for any inodes with watches. This is inefficient and can lead to livelocks as it iterates over many unwatched inodes. At this point, SB_ACTIVE is gone and dropping refcount to zero kicks the inode out out immediately, so anything processed by fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop. After that, the call to evict_inodes will evict everything else with a zero refcount. This should speed things up overall, and avoid livelocks in fsnotify_unmount_inodes(). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | fs: avoid softlockups in s_inodes iteratorsEric Sandeen2019-12-184-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Anything that walks all inodes on sb->s_inodes list without rescheduling risks softlockups. Previous efforts were made in 2 functions, see: c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() ac05fbb inode: don't softlockup when evicting inodes but there hasn't been an audit of all walkers, so do that now. This also consistently moves the cond_resched() calls to the bottom of each loop in cases where it already exists. One loop remains: remove_dquot_ref(), because I'm not quite sure how to deal with that one w/o taking the i_lock. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2019-12-2212-103/+340
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: "Fix a few bugs that could lead to corrupt files, fsck complaints, and filesystem crashes: - Minor documentation fixes - Fix a file corruption due to read racing with an insert range operation. - Fix log reservation overflows when allocating large rt extents - Fix a buffer log item flags check - Don't allow administrators to mount with sunit= options that will cause later xfs_repair complaints about the root directory being suspicious because the fs geometry appeared inconsistent - Fix a non-static helper that should have been static" * tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Make the symbol 'xfs_rtalloc_log_count' static xfs: don't commit sunit/swidth updates to disk if that would cause repair failures xfs: split the sunit parameter update into two parts xfs: refactor agfl length computation function libxfs: resync with the userspace libxfs xfs: use bitops interface for buf log item AIL flag check xfs: fix log reservation overflows when allocating large rt extents xfs: stabilize insert range start boundary to avoid COW writeback race xfs: fix Sphinx documentation warning
| * | | xfs: Make the symbol 'xfs_rtalloc_log_count' staticChen Wandun2019-12-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following sparse warning: fs/xfs/libxfs/xfs_trans_resv.c:206:1: warning: symbol 'xfs_rtalloc_log_count' was not declared. Should it be static? Fixes: b1de6fc7520f ("xfs: fix log reservation overflows when allocating large rt extents") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: don't commit sunit/swidth updates to disk if that would cause repair ↵Darrick J. Wong2019-12-194-1/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | failures Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit and swidth values could cause xfs_repair to fail loudly. The problem here is that repair calculates the where mkfs should have allocated the root inode, based on the superblock geometry. The allocation decisions depend on sunit, which means that we really can't go updating sunit if it would lead to a subsequent repair failure on an otherwise correct filesystem. Port from xfs_repair some code that computes the location of the root inode and teach mount to skip the ondisk update if it would cause problems for repair. Along the way we'll update the documentation, provide a function for computing the minimum AGFL size instead of open-coding it, and cut down some indenting in the mount code. Note that we allow the mount to proceed (and new allocations will reflect this new geometry) because we've never screened this kind of thing before. We'll have to wait for a new future incompat feature to enforce correct behavior, alas. Note that the geometry reporting always uses the superblock values, not the incore ones, so that is what xfs_info and xfs_growfs will report. [1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7d Reported-by: Alex Lyakas <alex@zadara.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | xfs: split the sunit parameter update into two partsDarrick J. Wong2019-12-191-51/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the administrator provided a sunit= mount option, we need to validate the raw parameter, convert the mount option units (512b blocks) into the internal unit (fs blocks), and then validate that the (now cooked) parameter doesn't screw anything up on disk. The incore inode geometry computation can depend on the new sunit option, but a subsequent patch will make validating the cooked value depends on the computed inode geometry, so break the sunit update into two steps. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | xfs: refactor agfl length computation functionDarrick J. Wong2019-12-191-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor xfs_alloc_min_freelist to accept a NULL @pag argument, in which case it returns the largest possible minimum length. This will be used in an upcoming patch to compute the length of the AGFL at mkfs time. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | libxfs: resync with the userspace libxfsDarrick J. Wong2019-12-194-24/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare to resync the userspace libxfs with the kernel libxfs. There were a few things I missed -- a couple of static inline directory functions that have to be exported for xfs_repair; a couple of directory naming functions that make porting much easier if they're /not/ static inline; and a u16 usage that should have been uint16_t. None of these things are bugs in their own right; this just makes porting xfsprogs easier. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
| * | | xfs: use bitops interface for buf log item AIL flag checkBrian Foster2019-12-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xfs_log_item flags were converted to atomic bitops as of commit 22525c17ed ("xfs: log item flags are racy"). The assert check for AIL presence in xfs_buf_item_relse() still uses the old value based check. This likely went unnoticed as XFS_LI_IN_AIL evaluates to 0 and causes the assert to unconditionally pass. Fix up the check. Signed-off-by: Brian Foster <bfoster@redhat.com> Fixes: 22525c17ed ("xfs: log item flags are racy") Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | xfs: fix log reservation overflows when allocating large rt extentsDarrick J. Wong2019-12-171-19/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Omar Sandoval reported that a 4G fallocate on the realtime device causes filesystem shutdowns due to a log reservation overflow that happens when we log the rtbitmap updates. Factor rtbitmap/rtsummary updates into the the tr_write and tr_itruncate log reservation calculation. "The following reproducer results in a transaction log overrun warning for me: mkfs.xfs -f -r rtdev=/dev/vdc -d rtinherit=1 -m reflink=0 /dev/vdb mount -o rtdev=/dev/vdc /dev/vdb /mnt fallocate -l 4G /mnt/foo Reported-by: Omar Sandoval <osandov@osandov.com> Tested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | xfs: stabilize insert range start boundary to avoid COW writeback raceBrian Foster2019-12-112-2/+13
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | generic/522 (fsx) occasionally fails with a file corruption due to an insert range operation. The primary characteristic of the corruption is a misplaced insert range operation that differs from the requested target offset. The reason for this behavior is a race between the extent shift sequence of an insert range and a COW writeback completion that causes a front merge with the first extent in the shift. The shift preparation function flushes and unmaps from the target offset of the operation to the end of the file to ensure no modifications can be made and page cache is invalidated before file data is shifted. An insert range operation then splits the extent at the target offset, if necessary, and begins to shift the start offset of each extent starting from the end of the file to the start offset. The shift sequence operates at extent level and so depends on the preparation sequence to guarantee no changes can be made to the target range during the shift. If the block immediately prior to the target offset was dirty and shared, however, it can undergo writeback and move from the COW fork to the data fork at any point during the shift. If the block is contiguous with the block at the start offset of the insert range, it can front merge and alter the start offset of the extent. Once the shift sequence reaches the target offset, it shifts based on the latest start offset and silently changes the target offset of the operation and corrupts the file. To address this problem, update the shift preparation code to stabilize the start boundary along with the full range of the insert. Also update the existing corruption check to fail if any extent is shifted with a start offset behind the target offset of the insert range. This prevents insert from racing with COW writeback completion and fails loudly in the event of an unexpected extent shift. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2019-12-226-95/+104
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bug fixes from Ted Ts'o: "Ext4 bug fixes, including a regression fix" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: clarify impact of 'commit' mount option ext4: fix unused-but-set-variable warning in ext4_add_entry() jbd2: fix kernel-doc notation warning ext4: use RCU API in debug_print_tree ext4: validate the debug_want_extra_isize mount option at parse time ext4: reserve revoke credits in __ext4_new_inode ext4: unlock on error in ext4_expand_extra_isize() ext4: optimize __ext4_check_dir_entry() ext4: check for directory entries too close to block end ext4: fix ext4_empty_dir() for directories with holes
| * | | ext4: fix unused-but-set-variable warning in ext4_add_entry()Yunfeng Ye2019-12-221-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Warning is found when compile with "-Wunused-but-set-variable": fs/ext4/namei.c: In function ‘ext4_add_entry’: fs/ext4/namei.c:2167:23: warning: variable ‘sbi’ set but not used [-Wunused-but-set-variable] struct ext4_sb_info *sbi; ^~~ Fix this by moving the variable @sbi under CONFIG_UNICODE. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/cb5eb904-224a-9701-c38f-cb23514b1fff@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: use RCU API in debug_print_treePhong Tran2019-12-161-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct ext4_sb_info.system_blks was marked __rcu. But access the pointer without using RCU lock and dereference. Sparse warning with __rcu notation: block_validity.c:139:29: warning: incorrect type in argument 1 (different address spaces) block_validity.c:139:29: expected struct rb_root const * block_validity.c:139:29: got struct rb_root [noderef] <asn:4> * Link: https://lore.kernel.org/r/20191213153306.30744-1-tranmanphong@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Phong Tran <tranmanphong@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: validate the debug_want_extra_isize mount option at parse timeTheodore Ts'o2019-12-161-74/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of setting s_want_extra_size and then making sure that it is a valid value afterwards, validate the field before we set it. This avoids races and other problems when remounting the file system. Link: https://lore.kernel.org/r/20191215063020.GA11512@mit.edu Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-and-tested-by: syzbot+4a39a025912b265cacef@syzkaller.appspotmail.com
| * | | ext4: reserve revoke credits in __ext4_new_inodeyangerkun2019-12-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible that __ext4_new_inode will release the xattr block, so it will trigger a warning since there is revoke credits will be 0 if the handle == NULL. The below scripts can reproduce it easily. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 3861 at fs/jbd2/revoke.c:374 jbd2_journal_revoke+0x30e/0x540 fs/jbd2/revoke.c:374 ... __ext4_forget+0x1d7/0x800 fs/ext4/ext4_jbd2.c:248 ext4_free_blocks+0x213/0x1d60 fs/ext4/mballoc.c:4743 ext4_xattr_release_block+0x55b/0x780 fs/ext4/xattr.c:1254 ext4_xattr_block_set+0x1c2c/0x2c40 fs/ext4/xattr.c:2112 ext4_xattr_set_handle+0xa7e/0x1090 fs/ext4/xattr.c:2384 __ext4_set_acl+0x54d/0x6c0 fs/ext4/acl.c:214 ext4_init_acl+0x218/0x2e0 fs/ext4/acl.c:293 __ext4_new_inode+0x352a/0x42b0 fs/ext4/ialloc.c:1151 ext4_mkdir+0x2e9/0xbd0 fs/ext4/namei.c:2774 vfs_mkdir+0x386/0x5f0 fs/namei.c:3811 do_mkdirat+0x11c/0x210 fs/namei.c:3834 do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:294 ... ------------------------------------- scripts: mkfs.ext4 /dev/vdb mount /dev/vdb /mnt cd /mnt && mkdir dir && for i in {1..8}; do setfacl -dm "u:user_"$i":rx" dir; done mkdir dir/dir1 && mv dir/dir1 ./ sh repro.sh && add some user [root@localhost ~]# cat repro.sh while [ 1 -eq 1 ]; do rm -rf dir rm -rf dir1/dir1 mkdir dir for i in {1..8}; do setfacl -dm "u:test"$i":rx" dir; done setfacl -m "u:user_9:rx" dir & mkdir dir1/dir1 & done Before exec repro.sh, dir1 has inherit the default acl from dir, and xattr block of dir1 dir is not the same, so the h_refcount of these two dir's xattr block will be 1. Then repro.sh can trigger the warning with the situation show as below. The last h_refcount can be clear with mkdir, and __ext4_new_inode has not reserved revoke credits, so the warning will happened, fix it by reserve revoke credits in __ext4_new_inode. Thread 1 Thread 2 mkdir dir set default acl(will create a xattr block blk1 and the refcount of ext4_xattr_header will be 1) ... mkdir dir1/dir1 ->....->ext4_init_acl ->__ext4_set_acl(set default acl, will reuse blk1, and h_refcount will be 2) setfacl->ext4_set_acl->... ->ext4_xattr_block_set(will create new block blk2 to store xattr) ->__ext4_set_acl(set access acl, since h_refcount of blk1 is 2, will create blk3 to store xattr) ->ext4_xattr_release_block(dec h_refcount of blk1 to 1) ->ext4_xattr_release_block(dec h_refcount and since it is 0, will release the block and trigger the warning) Link: https://lore.kernel.org/r/20191213014900.47228-1-yangerkun@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: unlock on error in ext4_expand_extra_isize()Dan Carpenter2019-12-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to unlock the xattr before returning on this error path. Cc: stable@kernel.org # 4.13 Fixes: c03b45b853f5 ("ext4, project: expand inode extra size if possible") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20191213185010.6k7yl2tck3wlsdkt@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: optimize __ext4_check_dir_entry()Theodore Ts'o2019-12-141-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make __ext4_check_dir_entry() a bit easier to understand, and reduce the object size of the function by over 11%. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20191209004346.38526-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: check for directory entries too close to block endJan Kara2019-12-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4_check_dir_entry() currently does not catch a case when a directory entry ends so close to the block end that the header of the next directory entry would not fit in the remaining space. This can lead to directory iteration code trying to access address beyond end of current buffer head leading to oops. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: fix ext4_empty_dir() for directories with holesJan Kara2019-12-141-14/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Function ext4_empty_dir() doesn't correctly handle directories with holes and crashes on bh->b_data dereference when bh is NULL. Reorganize the loop to use 'offset' variable all the times instead of comparing pointers to current direntry with bh->b_data pointer. Also add more strict checking of '.' and '..' directory entries to avoid entering loop in possibly invalid state on corrupted filesystems. References: CVE-2019-19037 CC: stable@vger.kernel.org Fixes: 4e19d6b65fb4 ("ext4: allow directory holes") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | | pipe: fix empty pipe check in pipe_write()Jan Stancek2019-12-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb, with read side observing empty pipe and sleeping and write side running out of space and then sleeping as well. In this scenario there are 5 writers and 1 reader. Problem is that after pipe_write() reacquires pipe lock, it re-checks for empty pipe with potentially stale 'head' and doesn't wake up read side anymore. pipe->tail can advance beyond 'head', because there are multiple writers. Use pipe->head for empty pipe check after reacquiring lock to observe current state. Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour. Without patch it hanged within a minute. Fixes: 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic") Reported-by: Rachel Sibley <rasibley@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge tag 'io_uring-5.5-20191220' of git://git.kernel.dk/linux-blockLinus Torvalds2019-12-203-229/+493
|\ \ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: "Here's a set of fixes that should go into 5.5-rc3 for io_uring. This is bigger than I'd like it to be, mainly because we're fixing the case where an application reuses sqe data right after issue. This really must work, or it's confusing. With 5.5 we're flagging us as submit stable for the actual data, this must also be the case for SQEs. Honestly, I'd really like to add another series on top of this, since it cleans it up considerable and prevents any SQE reuse by design. I posted that here: https://lore.kernel.org/io-uring/20191220174742.7449-1-axboe@kernel.dk/T/#u and may still send it your way early next week once it's been looked at and had some more soak time (does pass all regression tests). With that series, we've unified the prep+issue handling, and only the prep phase even has access to the SQE. Anyway, outside of that, fixes in here for a few other issues that have been hit in testing or production" * tag 'io_uring-5.5-20191220' of git://git.kernel.dk/linux-block: io_uring: io_wq_submit_work() should not touch req->rw io_uring: don't wait when under-submitting io_uring: warn about unhandled opcode io_uring: read opcode and user_data from SQE exactly once io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable io_uring: make IORING_OP_CANCEL_ASYNC deferrable io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable io_uring: make HARDLINK imply LINK io_uring: any deferred command must have stable sqe data io_uring: remove 'sqe' parameter to the OP helpers that take it io_uring: fix pre-prepped issue with force_nonblock == true io-wq: re-add io_wq_current_is_worker() io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG io_uring: fix stale comment and a few typos
| * | | io_uring: io_wq_submit_work() should not touch req->rwJens Axboe2019-12-181-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I've been chasing a weird and obscure crash that was userspace stack corruption, and finally narrowed it down to a bit flip that made a stack address invalid. io_wq_submit_work() unconditionally flips the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work handler, this isn't valid. Normal read/write operations own that part of the request, on other types it could be something else. Move the IOCB_NOWAIT clear to the read/write handlers where it belongs. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: don't wait when under-submittingPavel Begunkov2019-12-181-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no reliable way to submit and wait in a single syscall, as io_submit_sqes() may under-consume sqes (in case of an early error). Then it will wait for not-yet-submitted requests, deadlocking the user in most cases. Don't wait/poll if can't submit all sqes Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: warn about unhandled opcodeJens Axboe2019-12-181-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have all the opcodes handled in terms of command prep and SQE reuse, add a printk_once() to warn about any potentially new and unhandled ones. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: read opcode and user_data from SQE exactly onceJens Axboe2019-12-181-25/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we defer a request, we can't be reading the opcode again. Ensure that the user_data and opcode fields are stable. For the user_data we already have a place for it, for the opcode we can fill a one byte hold and store that as well. For both of them, assign them when we originally read the SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data is switched to req->opcode and req->user_data. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: make IORING_OP_TIMEOUT_REMOVE deferrableJens Axboe2019-12-181-10/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the timeout remove op into the prep handling to make it safe for SQE reuse. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: make IORING_OP_CANCEL_ASYNC deferrableJens Axboe2019-12-181-4/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the async cancel op into the prep handling to make it safe for SQE reuse. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrableJens Axboe2019-12-181-14/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we defer these commands as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the poll add/remove into the prep handling to make it safe for SQE reuse. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: make HARDLINK imply LINKPavel Begunkov2019-12-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a link and there is no need to set IOSQE_IO_LINK separately, though it could be there. Add proper check and ensure that IOSQE_IO_HARDLINK implies IOSQE_IO_LINK. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: any deferred command must have stable sqe dataJens Axboe2019-12-181-49/+172
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're currently not retaining sqe data for accept, fsync, and sync_file_range. None of these commands need data outside of what is directly provided, hence it can't go stale when the request is deferred. However, it can get reused, if an application reuses SQE entries. Ensure that we retain the information we need and only read the sqe contents once, off the submission path. Most of this is just moving code into a prep and finish function. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: remove 'sqe' parameter to the OP helpers that take itJens Axboe2019-12-181-36/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We pass in req->sqe for all of them, no need to pass it in as the request is always passed in. This is a necessary prep patch to be able to cleanup/fix the request prep path. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: fix pre-prepped issue with force_nonblock == trueJens Axboe2019-12-181-77/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some of these code paths assume that any force_nonblock == true issue is not prepped, but that's not true if we did prep as part of link setup earlier. Check if we already have an async context allocate before setting up a new one. Cleanup the async context setup in general, we have a lot of duplicated code there. Fixes: 03b1230ca12a ("io_uring: ensure async punted sendmsg/recvmsg requests copy data") Fixes: f67676d160c6 ("io_uring: ensure async punted read/write requests copy iovec") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io-wq: re-add io_wq_current_is_worker()Jens Axboe2019-12-181-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 8cdda87a4414, we now have several use csaes for this helper. Reinstate it. Signed-off-by: Jens Axboe <axboe@kernel.dk>