summaryrefslogtreecommitdiffstats
path: root/fs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* ext4: inline jbd2_journal_[un]register_shrinker()Theodore Ts'o2021-07-083-99/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | The function jbd2_journal_unregister_shrinker() was getting called twice when the file system was getting unmounted. On Power and ARM platforms this was causing kernel crash when unmounting the file system, when a percpu_counter was destroyed twice. Fix this by removing jbd2_journal_[un]register_shrinker() functions, and inlining the shrinker setup and teardown into journal_init_common() and jbd2_journal_destroy(). This means that ext4 and ocfs2 now no longer need to know about registering and unregistering jbd2's shrinker. Also, while we're at it, rename the percpu counter from j_jh_shrink_count to j_checkpoint_jh_count, since this makes it clearer what this counter is intended to track. Link: https://lore.kernel.org/r/20210705145025.3363130-1-tytso@mit.edu Fixes: 4ba3fcdde7e3 ("jbd2,ext4: add a shrinker to release checkpointed buffers") Reported-by: Jon Hunter <jonathanh@nvidia.com> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix flags validity checking for EXT4_IOC_CHECKPOINTTheodore Ts'o2021-07-081-1/+1
| | | | | | | | | Use the correct bitmask when checking for any not-yet-supported flags. Link: https://lore.kernel.org/r/20210702173425.1276158-1-tytso@mit.edu Fixes: 351a0a3fbc35 ("ext4: add ioctl EXT4_IOC_CHECKPOINT") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Leah Rumancik <leah.rumancik@gmail.com>
* ext4: fix possible UAF when remounting r/o a mmp-protected file systemTheodore Ts'o2021-07-082-17/+20
| | | | | | | | | | | | | | | | | After commit 618f003199c6 ("ext4: fix memory leak in ext4_fill_super"), after the file system is remounted read-only, there is a race where the kmmpd thread can exit, causing sbi->s_mmp_tsk to point at freed memory, which the call to ext4_stop_mmpd() can trip over. Fix this by only allowing kmmpd() to exit when it is stopped via ext4_stop_mmpd(). Link: https://lore.kernel.org/r/20210707002433.3719773-1-tytso@mit.edu Reported-by: Ye Bin <yebin10@huawei.com> Bug-Report-Link: <20210629143603.2166962-1-yebin10@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: use ext4_grp_locked_error in mb_find_extentStephen Brennan2021-07-011-4/+5
| | | | | | | | | | | | | | | | | | Commit 5d1b1b3f492f ("ext4: fix BUG when calling ext4_error with locked block group") introduces ext4_grp_locked_error to handle unlocking a group in error cases. Otherwise, there is a possibility of a sleep while atomic. However, since 43c73221b3b1 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()"), mb_find_extent() has contained a ext4_error() call while a group spinlock is held. Replace this with ext4_grp_locked_error. Fixes: 43c73221b3b1 ("ext4: replace BUG_ON with WARN_ON in mb_find_extent()") Cc: <stable@vger.kernel.org> # 4.14+ Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20210623232114.34457-1-stephen.s.brennan@oracle.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix WARN_ON_ONCE(!buffer_uptodate) after an error writing the superblockYe Bin2021-07-012-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a writeback of the superblock fails with an I/O error, the buffer is marked not uptodate. However, this can cause a WARN_ON to trigger when we attempt to write superblock a second time. (Which might succeed this time, for cerrtain types of block devices such as iSCSI devices over a flaky network.) Try to detect this case in flush_stashed_error_work(), and also change __ext4_handle_dirty_metadata() so we always set the uptodate flag, not just in the nojournal case. Before this commit, this problem can be repliciated via: 1. dmsetup create dust1 --table '0 2097152 dust /dev/sdc 0 4096' 2. mount /dev/mapper/dust1 /home/test 3. dmsetup message dust1 0 addbadblock 0 10 4. cd /home/test 5. echo "XXXXXXX" > t After a few seconds, we got following warning: [ 80.654487] end_buffer_async_write: bh=0xffff88842f18bdd0 [ 80.656134] Buffer I/O error on dev dm-0, logical block 0, lost async page write [ 85.774450] EXT4-fs error (device dm-0): ext4_check_bdev_write_error:193: comm kworker/u16:8: Error while async write back metadata [ 91.415513] mark_buffer_dirty: bh=0xffff88842f18bdd0 [ 91.417038] ------------[ cut here ]------------ [ 91.418450] WARNING: CPU: 1 PID: 1944 at fs/buffer.c:1092 mark_buffer_dirty.cold+0x1c/0x5e [ 91.440322] Call Trace: [ 91.440652] __jbd2_journal_temp_unlink_buffer+0x135/0x220 [ 91.441354] __jbd2_journal_unfile_buffer+0x24/0x90 [ 91.441981] __jbd2_journal_refile_buffer+0x134/0x1d0 [ 91.442628] jbd2_journal_commit_transaction+0x249a/0x3240 [ 91.443336] ? put_prev_entity+0x2a/0x200 [ 91.443856] ? kjournald2+0x12e/0x510 [ 91.444324] kjournald2+0x12e/0x510 [ 91.444773] ? woken_wake_function+0x30/0x30 [ 91.445326] kthread+0x150/0x1b0 [ 91.445739] ? commit_timeout+0x20/0x20 [ 91.446258] ? kthread_flush_worker+0xb0/0xb0 [ 91.446818] ret_from_fork+0x1f/0x30 [ 91.447293] ---[ end trace 66f0b6bf3d1abade ]--- Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20210615090537.3423231-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Revert "ext4: consolidate checks for resize of bigalloc into ext4_resize_begin"Theodore Ts'o2021-07-012-4/+14
| | | | | | | | | | The function ext4_resize_begin() gets called from three different places, and online resize for bigalloc file systems is disallowed from the old-style online resize (EXT4_IOC_GROUP_ADD and EXT4_IOC_GROUP_EXTEND), but it *is* supposed to be allowed via EXT4_IOC_RESIZE_FS. This reverts commit e9f9f61d0cdcb7f0b0b5feb2d84aa1c5894751f3.
* jbd2: export jbd2_journal_[un]register_shrinker()Zhang Yi2021-06-301-0/+2
| | | | | | | | | | | | | | Export jbd2_journal_[un]register_shrinker() to fix this error when ext4 is built as a module: ERROR: modpost: "jbd2_journal_unregister_shrinker" undefined! ERROR: modpost: "jbd2_journal_register_shrinker" undefined! Fixes: 4ba3fcdde7e3 ("jbd2,ext4: add a shrinker to release checkpointed buffers") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210630083638.140218-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: notify sysfs on errors_count value changeJonathan Davies2021-06-303-0/+8
| | | | | | | | | | | | | | After s_error_count is incremented, signal the change in the corresponding sysfs attribute via sysfs_notify. This allows userspace to poll() on changes to /sys/fs/ext4/*/errors_count. [ Moved call of ext4_notify_error_sysfs() to flush_stashed_error_work() to avoid BUG's caused by calling sysfs_notify trying to sleep after being called from an invalid context. -- TYT ] Signed-off-by: Jonathan Davies <jonathan.davies@nutanix.com> Link: https://lore.kernel.org/r/20210611140209.28903-1-jonathan.davies@nutanix.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* fs: remove bdev_try_to_free_page callbackZhang Yi2021-06-241-15/+0
| | | | | | | | | | | After remove the unique user of sop->bdev_try_to_free_page() callback, we could remove the callback and the corresponding blkdev_releasepage() at all. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-9-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: remove bdev_try_to_free_page() callbackZhang Yi2021-06-241-21/+0
| | | | | | | | | | | | | After we introduce a jbd2 shrinker to release checkpointed buffer's journal head, we could free buffer without bdev_try_to_free_page() under memory pressure. So this patch remove the whole bdev_try_to_free_page() callback directly. It also remove many use-after-free issues relate to it together. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-8-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: simplify journal_clean_one_cp_list()Zhang Yi2021-06-241-26/+4
| | | | | | | | | | | | Now that __try_to_free_cp_buf() remove checkpointed buffer or transaction when the buffer is not 'busy', which is only called by journal_clean_one_cp_list(). This patch simplify this function by remove __try_to_free_cp_buf() and invoke __cp_buffer_busy() directly. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-7-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2,ext4: add a shrinker to release checkpointed buffersZhang Yi2021-06-243-0/+242
| | | | | | | | | | | | | | | | | | | | | | | | | | Current metadata buffer release logic in bdev_try_to_free_page() have a lot of use-after-free issues when umount filesystem concurrently, and it is difficult to fix directly because ext4 is the only user of s_op->bdev_try_to_free_page callback and we may have to add more special refcount or lock that is only used by ext4 into the common vfs layer, which is unacceptable. One better solution is remove the bdev_try_to_free_page callback, but the real problem is we cannot easily release journal_head on the checkpointed buffer, so try_to_free_buffers() cannot release buffers and page under memory pressure, which is more likely to trigger out-of-memory. So we cannot remove the callback directly before we find another way to release journal_head. This patch introduce a shrinker to free journal_head on the checkpointed transaction. After the journal_head got freed, try_to_free_buffers() could free buffer properly. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-6-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: remove redundant buffer io error checksZhang Yi2021-06-241-11/+2
| | | | | | | | | | | | Now that __jbd2_journal_remove_checkpoint() can detect buffer io error and mark journal checkpoint error, then we abort the journal later before updating log tail to ensure the filesystem works consistently. So we could remove other redundant buffer io error checkes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-5-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: don't abort the journal when freeing buffersZhang Yi2021-06-241-17/+0
| | | | | | | | | | | | | Now that we can be sure the journal is aborted once a buffer has failed to be written back to disk, we can remove the journal abort logic in jbd2_journal_try_to_free_buffers() which was introduced in commit c044f3d8360d ("jbd2: abort journal if free a async write error metadata buffer"), because it may cost and propably is not safe. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: ensure abort the journal if detect IO error when writing original ↵Zhang Yi2021-06-242-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | buffer back Although we merged c044f3d8360 ("jbd2: abort journal if free a async write error metadata buffer"), there is a race between jbd2_journal_try_to_free_buffers() and jbd2_journal_destroy(), so the jbd2_log_do_checkpoint() may still fail to detect the buffer write io error flag which may lead to filesystem inconsistency. jbd2_journal_try_to_free_buffers() ext4_put_super() jbd2_journal_destroy() __jbd2_journal_remove_checkpoint() detect buffer write error jbd2_log_do_checkpoint() jbd2_cleanup_journal_tail() <--- lead to inconsistency jbd2_journal_abort() Fix this issue by introducing a new atomic flag which only have one JBD2_CHECKPOINT_IO_ERROR bit now, and set it in __jbd2_journal_remove_checkpoint() when freeing a checkpoint buffer which has write_io_error flag. Then jbd2_journal_destroy() will detect this mark and abort the journal to prevent updating log tail. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: remove the out label in __jbd2_journal_remove_checkpoint()Zhang Yi2021-06-241-12/+12
| | | | | | | | | | | The 'out' lable just return the 'ret' value and seems not required, so remove this label and switch to return appropriate value immediately. This patch also do some minor cleanup, no logical change. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210610112440.3438139-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: no need to verify new add extent blockyangerkun2021-06-241-0/+1
| | | | | | | | | | | ext4_ext_grow_indepth will add a new extent block which has init the expected content. We can mark this buffer as verified so to stop a useless check in __read_extent_tree_block. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210609075545.1442160-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: clean up misleading comments for jbd2_fc_release_bufsyangerkun2021-06-241-8/+0
| | | | | | | | | | This comments was for jbd2_fc_wait_bufs, not for jbd2_fc_release_bufs. Remove this misleading comments. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210608141236.459441-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: add check to prevent attempting to resize an fs with sparse_super2Josh Triplett2021-06-241-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The in-kernel ext4 resize code doesn't support filesystem with the sparse_super2 feature. It fails with errors like this and doesn't finish the resize: EXT4-fs (loop0): resizing filesystem from 16640 to 7864320 blocks EXT4-fs warning (device loop0): verify_reserved_gdb:760: reserved GDT 2 missing grp 1 (32770) EXT4-fs warning (device loop0): ext4_resize_fs:2111: error (-22) occurred during file system resize EXT4-fs (loop0): resized filesystem to 2097152 To reproduce: mkfs.ext4 -b 4096 -I 256 -J size=32 -E resize=$((256*1024*1024)) -O sparse_super2 ext4.img 65M truncate -s 30G ext4.img mount ext4.img /mnt python3 -c 'import fcntl, os, struct ; fd = os.open("/mnt", os.O_RDONLY | os.O_DIRECTORY) ; fcntl.ioctl(fd, 0x40086610, struct.pack("Q", 30 * 1024 * 1024 * 1024 // 4096), False) ; os.close(fd)' dmesg | tail e2fsck ext4.img The userspace resize2fs tool has a check for this case: it checks if the filesystem has sparse_super2 set and if the kernel provides /sys/fs/ext4/features/sparse_super2. However, the former check requires manually reading and parsing the filesystem superblock. Detect this case in ext4_resize_begin and error out early with a clear error message. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: https://lore.kernel.org/r/74b8ae78405270211943cd7393e65586c5faeed1.1623093259.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: consolidate checks for resize of bigalloc into ext4_resize_beginJosh Triplett2021-06-242-14/+5
| | | | | | | | | | Two different places checked for attempts to resize a filesystem with the bigalloc feature. Move the check into ext4_resize_begin, which both places already call. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: https://lore.kernel.org/r/bee03303d999225ecb3bfa5be8576b2f4c6edbe6.1623093259.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: remove duplicate definition of ext4_xattr_ibody_inline_set()Ritesh Harjani2021-06-243-34/+9
| | | | | | | | | | | | | | ext4_xattr_ibody_inline_set() & ext4_xattr_ibody_set() have the exact same definition. Hence remove ext4_xattr_ibody_inline_set() and all its call references. Convert the callers of it to call ext4_xattr_ibody_set() instead. [ Modified to preserve ext4_xattr_ibody_set() and remove ext4_xattr_ibody_inline_set() instead. -- TYT ] Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/fd566b799bbbbe9b668eb5eecde5b5e319e3694f.1622685482.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fsmap: fix the block/inode bitmap commentRitesh Harjani2021-06-241-2/+2
| | | | | | | | | | While debugging fstest ext4/027 failure, found below comment to be wrong and confusing. Hence fix it while we are at it. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/e79134132db7ea42f15747b5c669ee91cc1aacdf.1622432690.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix comment for s_hash_unsignedEric Biggers2021-06-241-1/+1
| | | | | | | | | Fix the comment for s_hash_unsigned to not be the opposite of what it actually is. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20210527235557.2377525-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: use local variable ei instead of EXT4_I() macrochenyichong2021-06-231-1/+1
| | | | | | | Signed-off-by: chenyichong <chenyichong@uniontech.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20210526052930.11278-1-chenyichong@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix avefreec in find_group_orlovPan Dong2021-06-231-6/+5
| | | | | | | | | | | The avefreec should be average free clusters instead of average free blocks, otherwize Orlov's allocator will not work properly when bigalloc enabled. Cc: stable@kernel.org Signed-off-by: Pan Dong <pandong.peter@bytedance.com> Link: https://lore.kernel.org/r/20210525073656.31594-1-pandong.peter@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: correct the cache_nr in tracepoint ext4_es_shrink_exitZhang Yi2021-06-231-0/+1
| | | | | | | | | | | | | The cache_cnt parameter of tracepoint ext4_es_shrink_exit means the remaining cache count after shrink, but now it is the cache count before shrink, fix it by read sbi->s_extent_cache_cnt again. Fixes: 1ab6c4997e04 ("fs: convert fs shrinkers to new scan/count API") Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210522103045.690103-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: remove check for zero nr_to_scan in ext4_es_scan()Zhang Yi2021-06-231-3/+0
| | | | | | | | | | | | | After converting fs shrinkers to new scan/count API, we are no longer pass zero nr_to_scan parameter to detect the number of objects to free, just remove this check. Fixes: 1ab6c4997e04 ("fs: convert fs shrinkers to new scan/count API") Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210522103045.690103-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: remove set but rewrite variablesTian Tao2021-06-231-1/+1
| | | | | | | | | | | In the ext4_dx_add_entry function, the at variable is assigned but will reset just after “again:” label. So delete the unnecessary assignment. this will not chang the logic. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com> Link: https://lore.kernel.org/r/1621493752-36890-1-git-send-email-tiantao6@hisilicon.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: add ioctl EXT4_IOC_CHECKPOINTLeah Rumancik2021-06-232-0/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ioctl EXT4_IOC_CHECKPOINT checkpoints and flushes the journal. This includes forcing all the transactions to the log, checkpointing the transactions, and flushing the log to disk. This ioctl takes u32 "flags" as an argument. Three flags are supported. EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN can be used to verify input to the ioctl. It returns error if there is any invalid input, otherwise it returns success without performing any checkpointing. The other two flags, EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT, can be used to issue requests to discard or zeroout the journal logs blocks, respectively. At this point, EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT is primarily added to enable testing of this codepath on devices that don't support discard. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT cannot both be set. Systems that wish to achieve content deletion SLO can set up a daemon that calls this ioctl at a regular interval such that it matches with the SLO requirement. Thus, with this patch, the ext4_dir_entry2 wipeout patch[1], and the Ext4 "-o discard" mount option set, Ext4 can now guarantee that all file contents, file metatdata, and filenames will not be accessible through the filesystem and will have had discard or zeroout requests issued for corresponding device blocks. The __jbd2_journal_erase function could also be used to discard or zero-fill the journal during journal load after recovery. This would provide a potential solution to a journal replay bug reported earlier this year[2]. After a successful journal recovery, e2fsck can call this ioctl to discard the journal as well. [1] https://lore.kernel.org/linux-ext4/YIHknqxngB1sUdie@mit.edu/ [2] https://lore.kernel.org/linux-ext4/YDZoaacIYStFQT8g@mit.edu/ Link: https://lore.kernel.org/r/20210518151327.130198-2-leah.rumancik@gmail.com Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: add discard/zeroout flags to journal flushLeah Rumancik2021-06-236-16/+129
| | | | | | | | | Add a flags argument to jbd2_journal_flush to enable discarding or zero-filling the journal blocks while flushing the journal. Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Link: https://lore.kernel.org/r/20210518151327.130198-1-leah.rumancik@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: return error code when ext4_fill_flex_info() failsYang Yingliang2021-06-171-0/+1
| | | | | | | | | | | | | | | After commit c89128a00838 ("ext4: handle errors on ext4_commit_super"), 'ret' may be set to 0 before calling ext4_fill_flex_info(), if ext4_fill_flex_info() fails ext4_mount() doesn't return error code, it makes 'root' is null which causes crash in legacy_get_tree(). Fixes: c89128a00838 ("ext4: handle errors on ext4_commit_super") Reported-by: Hulk Robot <hulkci@huawei.com> Cc: <stable@vger.kernel.org> # v4.18+ Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20210510111051.55650-1-yangyingliang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: cleanup in-core orphan list if ext4_truncate() failed to get a ↵Zhang Yi2021-06-171-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | transaction handle In ext4_orphan_cleanup(), if ext4_truncate() failed to get a transaction handle, it didn't remove the inode from the in-core orphan list, which may probably trigger below error dump in ext4_destroy_inode() during the final iput() and could lead to memory corruption on the later orphan list changes. EXT4-fs (sda): Inode 6291467 (00000000b8247c67): orphan list check failed! 00000000b8247c67: 0001f30a 00000004 00000000 00000023 ............#... 00000000e24cde71: 00000006 014082a3 00000000 00000000 ......@......... 0000000072c6a5ee: 00000000 00000000 00000000 00000000 ................ ... This patch fix this by cleanup in-core orphan list manually if ext4_truncate() return error. Cc: stable@kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210507071904.160808-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix kernel infoleak via ext4_extent_headerAnirudh Rayabharam2021-06-171-0/+3
| | | | | | | | | | | | | Initialize eh_generation of struct ext4_extent_header to prevent leaking info to userspace. Fixes KMSAN kernel-infoleak bug reported by syzbot at: http://syzkaller.appspot.com/bug?id=78e9ad0e6952a3ca16e8234724b2fa92d041b9b8 Cc: stable@kernel.org Reported-by: syzbot+2dcfeaf8cb49b05e8f1a@syzkaller.appspotmail.com Fixes: a86c61812637 ("[PATCH] ext3: add extent map support") Signed-off-by: Anirudh Rayabharam <mail@anirudhrb.com> Link: https://lore.kernel.org/r/20210506185655.7118-1-mail@anirudhrb.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix memory leak in ext4_fill_superPavel Skripkin2021-06-173-21/+21
| | | | | | | | | | | | | | | | | | | static int kthread(void *_create) will return -ENOMEM or -EINTR in case of internal failure or kthread_stop() call happens before threadfn call. To prevent fancy error checking and make code more straightforward we moved all cleanup code out of kmmpd threadfn. Also, dropped struct mmpd_data at all. Now struct super_block is a threadfn data and struct buffer_head embedded into struct ext4_sb_info. Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Link: https://lore.kernel.org/r/20210430185046.15742-1-paskripkin@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: remove redundant assignment to errorJiapeng Chong2021-06-171-3/+2
| | | | | | | | | | | | | | | Variable error is set to zero but this value is never read as it's not used later on, hence it is a redundant assignment and can be removed. Cleans up the following clang-analyzer warning: fs/ext4/ioctl.c:657:3: warning: Value stored to 'error' is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/1619691409-83160-1-git-send-email-jiapeng.chong@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: remove redundant check buffer_uptodate()Joseph Qi2021-06-171-1/+1
| | | | | | | | | | Now set_buffer_uptodate() will test first and then set, so we don't have to check buffer_uptodate() first, remove it to simplify code. Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://lore.kernel.org/r/1619418587-5580-1-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix overflow in ext4_iomap_alloc()Jan Kara2021-06-171-1/+1
| | | | | | | | | | | | | A code in iomap alloc may overflow block number when converting it to byte offset. Luckily this is mostly harmless as we will just use more expensive method of writing using unwritten extents even though we are writing beyond i_size. Cc: stable@kernel.org Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210412102333.2676-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2021-06-068-126/+135
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous ext4 bug fixes" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Only advertise encrypted_casefold when encryption and unicode are enabled ext4: fix no-key deletion for encrypt+casefold ext4: fix memory leak in ext4_fill_super ext4: fix fast commit alignment issues ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed ext4: fix accessing uninit percpu counter variable with fast_commit ext4: fix memory leak in ext4_mb_init_backend on error path.
| * ext4: Only advertise encrypted_casefold when encryption and unicode are enabledDaniel Rosenberg2021-06-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | Encrypted casefolding is only supported when both encryption and casefolding are both enabled in the config. Fixes: 471fbbea7ff7 ("ext4: handle casefolding with encryption") Cc: stable@vger.kernel.org # 5.13+ Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20210603094849.314342-1-drosen@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix no-key deletion for encrypt+casefoldDaniel Rosenberg2021-06-061-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 471fbbea7ff7 ("ext4: handle casefolding with encryption") is missing a few checks for the encryption key which are needed to support deleting enrypted casefolded files when the key is not present. This bug made it impossible to delete encrypted+casefolded directories without the encryption key, due to errors like: W : EXT4-fs warning (device vdc): __ext4fs_dirhash:270: inode #49202: comm Binder:378_4: Siphash requires key Repro steps in kvm-xfstests test appliance: mkfs.ext4 -F -E encoding=utf8 -O encrypt /dev/vdc mount /vdc mkdir /vdc/dir chattr +F /vdc/dir keyid=$(head -c 64 /dev/zero | xfs_io -c add_enckey /vdc | awk '{print $NF}') xfs_io -c "set_encpolicy $keyid" /vdc/dir for i in `seq 1 100`; do mkdir /vdc/dir/$i done xfs_io -c "rm_enckey $keyid" /vdc rm -rf /vdc/dir # fails with the bug Fixes: 471fbbea7ff7 ("ext4: handle casefolding with encryption") Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20210522004132.2142563-1-drosen@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix memory leak in ext4_fill_superAlexey Makhalov2021-06-061-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Buffer head references must be released before calling kill_bdev(); otherwise the buffer head (and its page referenced by b_data) will not be freed by kill_bdev, and subsequently that bh will be leaked. If blocksizes differ, sb_set_blocksize() will kill current buffers and page cache by using kill_bdev(). And then super block will be reread again but using correct blocksize this time. sb_set_blocksize() didn't fully free superblock page and buffer head, and being busy, they were not freed and instead leaked. This can easily be reproduced by calling an infinite loop of: systemctl start <ext4_on_lvm>.mount, and systemctl stop <ext4_on_lvm>.mount ... since systemd creates a cgroup for each slice which it mounts, and the bh leak get amplified by a dying memory cgroup that also never gets freed, and memory consumption is much more easily noticed. Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize") Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3") Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.com Signed-off-by: Alexey Makhalov <amakhalov@vmware.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: fix fast commit alignment issuesHarshad Shirwadkar2021-06-062-99/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fast commit recovery data on disk may not be aligned. So, when the recovery code reads it, this patch makes sure that fast commit info found on-disk is first memcpy-ed into an aligned variable before accessing it. As a consequence of it, we also remove some macros that could resulted in unaligned accesses. Cc: stable@kernel.org Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20210519215920.2037527-1-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failedYe Bin2021-06-061-20/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We got follow bug_on when run fsstress with injecting IO fault: [130747.323114] kernel BUG at fs/ext4/extents_status.c:762! [130747.323117] Internal error: Oops - BUG: 0 [#1] SMP ...... [130747.334329] Call trace: [130747.334553] ext4_es_cache_extent+0x150/0x168 [ext4] [130747.334975] ext4_cache_extents+0x64/0xe8 [ext4] [130747.335368] ext4_find_extent+0x300/0x330 [ext4] [130747.335759] ext4_ext_map_blocks+0x74/0x1178 [ext4] [130747.336179] ext4_map_blocks+0x2f4/0x5f0 [ext4] [130747.336567] ext4_mpage_readpages+0x4a8/0x7a8 [ext4] [130747.336995] ext4_readpage+0x54/0x100 [ext4] [130747.337359] generic_file_buffered_read+0x410/0xae8 [130747.337767] generic_file_read_iter+0x114/0x190 [130747.338152] ext4_file_read_iter+0x5c/0x140 [ext4] [130747.338556] __vfs_read+0x11c/0x188 [130747.338851] vfs_read+0x94/0x150 [130747.339110] ksys_read+0x74/0xf0 This patch's modification is according to Jan Kara's suggestion in: https://patchwork.ozlabs.org/project/linux-ext4/patch/20210428085158.3728201-1-yebin10@huawei.com/ "I see. Now I understand your patch. Honestly, seeing how fragile is trying to fix extent tree after split has failed in the middle, I would probably go even further and make sure we fix the tree properly in case of ENOSPC and EDQUOT (those are easily user triggerable). Anything else indicates a HW problem or fs corruption so I'd rather leave the extent tree as is and don't try to fix it (which also means we will not create overlapping extents)." Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210506141042.3298679-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix accessing uninit percpu counter variable with fast_commitRitesh Harjani2021-06-031-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running generic/527 with fast_commit configuration, the following issue is seen on Power. With fast_commit, during ext4_fc_replay() (which can be called from ext4_fill_super()), if inode eviction happens then it can access an uninitialized percpu counter variable. This patch adds the check before accessing the counters in ext4_free_inode() path. [ 321.165371] run fstests generic/527 at 2021-04-29 08:38:43 [ 323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none. [ 323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000 [ 323.619767] Faulting instruction address: 0xc000000000bae78c cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0] pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100 lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90 pid = 5593, comm = mount ext4_free_inode+0x780/0xb90 ext4_evict_inode+0xa8c/0xc60 evict+0xfc/0x1e0 ext4_fc_replay+0xc50/0x20f0 do_one_pass+0xfe0/0x1350 jbd2_journal_recover+0x184/0x2e0 jbd2_journal_load+0x1c0/0x4a0 ext4_fill_super+0x2458/0x4200 mount_bdev+0x1dc/0x290 ext4_mount+0x28/0x40 legacy_get_tree+0x4c/0xa0 vfs_get_tree+0x4c/0x120 path_mount+0xcf8/0xd70 do_mount+0x80/0xd0 sys_mount+0x3fc/0x490 system_call_exception+0x384/0x3d0 system_call_common+0xec/0x278 Cc: stable@kernel.org Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/6cceb9a75c54bef8fa9696c1b08c8df5ff6169e2.1619692410.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix memory leak in ext4_mb_init_backend on error path.Phillip Potter2021-05-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | Fix a memory leak discovered by syzbot when a file system is corrupted with an illegally large s_log_groups_per_flex. Reported-by: syzbot+aa12d6106ea4ca1b6aae@syzkaller.appspotmail.com Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20210412073837.1686-1-phil@philpotter.co.uk Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ocfs2: fix data corruption by fallocateJunxiao Bi2021-06-051-5/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When fallocate punches holes out of inode size, if original isize is in the middle of last cluster, then the part from isize to the end of the cluster will be zeroed with buffer write, at that time isize is not yet updated to match the new size, if writeback is kicked in, it will invoke ocfs2_writepage()->block_write_full_page() where the pages out of inode size will be dropped. That will cause file corruption. Fix this by zero out eof blocks when extending the inode size. Running the following command with qemu-image 4.2.1 can get a corrupted coverted image file easily. qemu-img convert -p -t none -T none -f qcow2 $qcow_image \ -O qcow2 -o compat=1.1 $qcow_image.conv The usage of fallocate in qemu is like this, it first punches holes out of inode size, then extend the inode size. fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2276196352, 65536) = 0 fallocate(11, 0, 2276196352, 65536) = 0 v1: https://www.spinics.net/lists/linux-fsdevel/msg193999.html v2: https://lore.kernel.org/linux-fsdevel/20210525093034.GB4112@quack2.suse.cz/T/ Link: https://lkml.kernel.org/r/20210528210648.9124-1-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'io_uring-5.13-2021-06-03' of git://git.kernel.dk/linux-blockLinus Torvalds2021-06-031-0/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fix from Jens Axboe: "Just a single one-liner fix for an accounting regression in this release" * tag 'io_uring-5.13-2021-06-03' of git://git.kernel.dk/linux-block: io_uring: fix misaccounting fix buf pinned pages
| * | io_uring: fix misaccounting fix buf pinned pagesPavel Begunkov2021-05-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As Andres reports "... io_sqe_buffer_register() doesn't initialize imu. io_buffer_account_pin() does imu->acct_pages++, before calling io_account_mem(ctx, imu->acct_pages).", leading to evevntual -ENOMEM. Initialise the field. Reported-by: Andres Freund <andres@anarazel.de> Fixes: 41edf1a5ec967 ("io_uring: keep table of pointers to ubufs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/438a6f46739ae5e05d9c75a0c8fa235320ff367c.1622285901.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'for-5.13-rc4-tag' of ↵Linus Torvalds2021-06-036-58/+147
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Error handling improvements, caught by error injection: - handle errors during checksum deletion - set error on mapping when ordered extent io cannot be finished - inode link count fixup in tree-log - missing return value checks for inode updates in tree-log - abort transaction in rename exchange if adding second reference fails Fixes: - fix fsync failure after writes to prealloc extents - fix deadlock when cloning inline extents and low on available space - fix compressed writes that cross stripe boundary" * tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: MAINTAINERS: add btrfs IRC link btrfs: fix deadlock when cloning inline extents and low on available space btrfs: fix fsync failure and transaction abort after writes to prealloc extents btrfs: abort in rename_exchange if we fail to insert the second ref btrfs: check error value from btrfs_update_inode in tree log btrfs: fixup error handling in fixup_inode_link_counts btrfs: mark ordered extent and inode with error if we fail to finish btrfs: return errors from btrfs_del_csums in cleanup_ref_head btrfs: fix error handling in btrfs_del_csums btrfs: fix compressed writes that cross stripe boundary
| * | | btrfs: fix deadlock when cloning inline extents and low on available spaceFilipe Manana2021-05-271-16/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a few cases where cloning an inline extent requires copying data into a page of the destination inode. For these cases we are allocating the required data and metadata space while holding a leaf locked. This can result in a deadlock when we are low on available space because allocating the space may flush delalloc and two deadlock scenarios can happen: 1) When starting writeback for an inode with a very small dirty range that fits in an inline extent, we deadlock during the writeback when trying to insert the inline extent, at cow_file_range_inline(), if the extent is going to be located in the leaf for which we are already holding a read lock; 2) After successfully starting writeback, for non-inline extent cases, the async reclaim thread will hang waiting for an ordered extent to complete if the ordered extent completion needs to modify the leaf for which the clone task is holding a read lock (for adding or replacing file extent items). So the cloning task will wait forever on the async reclaim thread to make progress, which in turn is waiting for the ordered extent completion which in turn is waiting to acquire a write lock on the same leaf. So fix this by making sure we release the path (and therefore the leaf) every time we need to copy the inline extent's data into a page of the destination inode, as by that time we do not need to have the leaf locked. Fixes: 05a5a7621ce66c ("Btrfs: implement full reflink support for inline extents") CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>