summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Enable ext4 support for per-file/directory dax operationsTheodore Ts'o2020-06-1118-60/+350
|\ | | | | | | | | | | This adds the same per-file/per-directory DAX support for ext4 as was done for xfs, now that we finally have consensus over what the interface should be.
| * Documentation/dax: Update DAX enablement for ext4Ira Weiny2020-05-291-3/+3
| | | | | | | | | | | | | | | | | | | | Update the document to reflect ext4 and xfs now behave the same. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-10-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs/ext4: Introduce DAX inode flagIra Weiny2020-05-296-7/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a flag ([EXT4|FS]_DAX_FL) to preserve FS_XFLAG_DAX in the ext4 inode. Set the flag to be user visible and changeable. Set the flag to be inherited. Allow applications to change the flag at any time except if it conflicts with the set of mutually exclusive flags (Currently VERITY, ENCRYPT, JOURNAL_DATA). Furthermore, restrict setting any of the exclusive flags if DAX is set. While conceptually possible, we do not allow setting EXT4_DAX_FL while at the same time clearing exclusion flags (or vice versa) for 2 reasons: 1) The DAX flag does not take effect immediately which introduces quite a bit of complexity 2) There is no clear use case for being this flexible Finally, on regular files, flag the inode to not be cached to facilitate changing S_DAX on the next creation of the inode. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-9-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs/ext4: Remove jflag variableIra Weiny2020-05-291-7/+4
| | | | | | | | | | | | | | | | | | The jflag variable serves almost no purpose. Remove it. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-8-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs/ext4: Make DAX mount option a tri-stateIra Weiny2020-05-293-10/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We add 'always', 'never', and 'inode' (default). '-o dax' continues to operate the same which is equivalent to 'always'. This new functionality is limited to ext4 only. Specifically we introduce a 2nd DAX mount flag EXT4_MOUNT2_DAX_NEVER and set it and EXT4_MOUNT_DAX_ALWAYS appropriately for the mode. We also force EXT4_MOUNT2_DAX_NEVER if !CONFIG_FS_DAX. Finally, EXT4_MOUNT2_DAX_INODE is used solely to detect if the user specified that option for printing. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-7-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs/ext4: Only change S_DAX on inode loadIra Weiny2020-05-296-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To prevent complications with in memory inodes we only set S_DAX on inode load. FS_XFLAG_DAX can be changed at any time and S_DAX will change after inode eviction and reload. Add init bool to ext4_set_inode_flags() to indicate if the inode is being newly initialized. Assert that S_DAX is not set on an inode which is just being loaded. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-6-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs/ext4: Update ext4_should_use_dax()Ira Weiny2020-05-293-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | S_DAX should only be enabled when the underlying block device supports dax. Cache the underlying support for DAX in the super block and modify ext4_should_use_dax() to check for device support prior to the over riding mount option. While we are at it change the function to ext4_should_enable_dax() as this better reflects the ask as well as matches xfs. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-5-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYSIra Weiny2020-05-293-9/+9
| | | | | | | | | | | | | | | | | | | | | | In prep for the new tri-state mount option which then introduces EXT4_MOUNT_DAX_NEVER. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-4-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs/ext4: Disallow verity if inode is DAXIra Weiny2020-05-292-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Verity and DAX are incompatible. Changing the DAX mode due to a verity flag change is wrong without a corresponding address_space_operations update. Make the 2 options mutually exclusive by returning an error if DAX was set first. (Setting DAX is already disabled if Verity is set first.) Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-3-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs/ext4: Narrow scope of DAX check in setflagsIra Weiny2020-05-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | When preventing DAX and journaling on an inode. Use the effective DAX check rather than the mount option. This will be required to support per inode DAX flags. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-2-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs: Introduce DCACHE_DONTCACHEIra Weiny2020-05-134-1/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DCACHE_DONTCACHE indicates a dentry should not be cached on final dput(). Also add a helper function to mark DCACHE_DONTCACHE on all dentries pointing to a specific inode when that inode is being set I_DONTCACHE. This facilitates dropping dentry references to inodes sooner which require eviction to swap S_DAX mode. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * fs: Lift XFS_IDONTCACHE to the VFS layerIra Weiny2020-05-134-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DAX effective mode (S_DAX) changes requires inode eviction. XFS has an advisory flag (XFS_IDONTCACHE) to prevent caching of the inode if no other additional references are taken. We lift this flag to the VFS layer and change the behavior slightly by allowing the flag to remain even if multiple references are taken. This will expedite the eviction of inodes to change S_DAX. Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * Documentation/dax: Update Usage sectionIra Weiny2020-05-041-3/+139
| | | | | | | | | | | | | | | | | | | | Update the Usage section to reflect the new individual dax selection functionality. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * fs/stat: Define DAX statx attributeIra Weiny2020-05-042-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order for users to determine if a file is currently operating in DAX state (effective DAX). Define a statx attribute value and set that attribute if the effective DAX flag is set. To go along with this we propose the following addition to the statx man page: STATX_ATTR_DAX The file is in the DAX (cpu direct access) state. DAX state attempts to minimize software cache effects for both I/O and memory mappings of this file. It requires a file system which has been configured to support DAX. DAX generally assumes all accesses are via cpu load / store instructions which can minimize overhead for small accesses, but may adversely affect cpu utilization for large transfers. File I/O is done directly to/from user-space buffers and memory mapped I/O may be performed with direct memory mappings that bypass kernel page cache. While the DAX property tends to result in data being transferred synchronously, it does not give the same guarantees of O_SYNC where data and the necessary metadata are transferred together. A DAX file may support being mapped with the MAP_SYNC flag, which enables a program to use CPU cache flush instructions to persist CPU store operations without an explicit fsync(2). See mmap(2) for more information. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * fs: Remove unneeded IS_DAX() check in io_is_direct()Ira Weiny2020-05-042-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the check because DAX now has it's own read/write methods and file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on their own. Therefore, it does not matter if the file state is DAX when the iocb flags are created. Also remove io_is_direct() as it is just a simple flag check. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | ext4: avoid unnecessary transaction starts during writebackJan Kara2020-06-041-18/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4_writepages() currently works in a loop like: start a transaction scan inode for pages to write map and submit these pages stop the transaction This loop results in starting transaction once more than is needed because in the last iteration we start a transaction only to scan the inode and find there are no pages to write. This can be significant increase in number of transaction starts for single-extent files or files that have all blocks already mapped. Furthermore we already know from previous iteration whether there are more pages to write or not. So propagate the information from mpage_prepare_extent_to_map() and avoid unnecessary looping in case there are no more pages to write. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200525081215.29451-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: don't block for O_DIRECT if IOCB_NOWAIT is setJens Axboe2020-06-041-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Running with some debug patches to detect illegal blocking triggered the extend/unaligned condition in ext4. If ext4 needs to extend the file (and hence go to buffered IO), or if the app is doing unaligned IO, then ext4 asks the iomap code to wait for IO completion. If the caller asked for no-wait semantics by setting IOCB_NOWAIT, then ext4 should return -EAGAIN instead. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/76152096-2bbb-7682-8fce-4cb498bcd909@kernel.dk Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: remove the access_ok() check in ext4_ioctl_get_es_cacheChristoph Hellwig2020-06-041-5/+0
| | | | | | | | | | | | | | | | | | | | | | access_ok just checks we are fed a proper user pointer. We also do that in copy_to_user itself, so no need to do this early. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200523073016.2944131-10-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | fs: remove the access_ok() check in ioctl_fiemapChristoph Hellwig2020-06-041-5/+1
| | | | | | | | | | | | | | | | | | | | | | access_ok just checks we are fed a proper user pointer. We also do that in copy_to_user itself, so no need to do this early. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-9-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | fs: handle FIEMAP_FLAG_SYNC in fiemap_prepChristoph Hellwig2020-06-0410-31/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | By moving FIEMAP_FLAG_SYNC handling to fiemap_prep we ensure it is handled once instead of duplicated, but can still be done under fs locks, like xfs/iomap intended with its duplicate handling. Also make sure the error value of filemap_write_and_wait is propagated to user space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-8-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | fs: move fiemap range validation into the file systems instancesChristoph Hellwig2020-06-0410-54/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace fiemap_check_flags with a fiemap_prep helper that also takes the inode and mapped range, and performs the sanity check and truncation previously done in fiemap_check_range. This way the validation is inside the file system itself and thus properly works for the stacked overlayfs case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | iomap: fix the iomap_fiemap prototypeChristoph Hellwig2020-06-042-2/+2
| | | | | | | | | | | | | | | | | | | | | | iomap_fiemap should take u64 start and len arguments, just like the ->fiemap prototype. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | fs: move the fiemap definitions out of fs.hChristoph Hellwig2020-06-0418-21/+43
| | | | | | | | | | | | | | | | | | | | | | No need to pull the fiemap definitions into almost every file in the kernel build. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | fs: mark __generic_block_fiemap staticChristoph Hellwig2020-06-042-7/+1
| | | | | | | | | | | | | | | | | | | | | | There is no caller left outside of ioctl.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-4-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: remove the call to fiemap_check_flags in ext4_fiemapChristoph Hellwig2020-06-041-3/+0
| | | | | | | | | | | | | | | | | | | | | | iomap_fiemap already calls fiemap_check_flags first thing, so this additional check is redundant. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200523073016.2944131-3-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: split _ext4_fiemapChristoph Hellwig2020-06-041-37/+35
| | | | | | | | | | | | | | | | | | | | | | The fiemap and EXT4_IOC_GET_ES_CACHE cases share almost no code, so split them into entirely separate functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200523073016.2944131-2-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: fix fiemap size checks for bitmap filesChristoph Hellwig2020-06-042-31/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | Add an extra validation of the len parameter, as for ext4 some files might have smaller file size limits than others. This also means the redundant size check in ext4_ioctl_get_es_cache can go away, as all size checking is done in the shared fiemap handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200505154324.3226743-3-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: fix EXT4_MAX_LOGICAL_BLOCK macroRitesh Harjani2020-06-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4 supports max number of logical blocks in a file to be 0xffffffff. (This is since ext4_extent's ee_block is __le32). This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting from 0 logical offset). This patch fixes this. The issue was seen when ext4 moved to iomap_fiemap API and when overlayfs was mounted on top of ext4. Since overlayfs was missing filemap_check_ranges(), so it could pass a arbitrary huge length which lead to overflow of map.m_len logic. This patch fixes that. Fixes: d3b6f23f7167 ("ext4: move ext4_fiemap to use iomap framework") Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | add comment for ext4_dir_entry_2 file_type memberJonathan Grant2020-06-041-1/+1
| | | | | | | | | | | | | | Signed-off-by: Jonathan Grant <jg@jguk.org> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/ad3290d5-86af-99c1-f9d5-cd1bab710429@jguk.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | jbd2: avoid leaking transaction credits when unreserving handleJan Kara2020-06-041-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reserved transaction handle is unused, we subtract its reserved credits in __jbd2_journal_unreserve_handle() called from jbd2_journal_stop(). However this function forgets to remove reserved credits from transaction->t_outstanding_credits and thus the transaction space that was reserved remains effectively leaked. The leaked transaction space can be quite significant in some cases and leads to unnecessarily small transactions and thus reducing throughput of the journalling machinery. E.g. fsmark workload creating lots of 4k files was observed to have about 20% lower throughput due to this when ext4 is mounted with dioread_nolock mount option. Subtract reserved credits from t_outstanding_credits as well. CC: stable@vger.kernel.org Fixes: 8f7d89f36829 ("jbd2: transaction reservation support") Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200520133119.1383-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: drop ext4_journal_free_reserved()Jan Kara2020-06-041-6/+0
| | | | | | | | | | | | | | | | | | Remove ext4_journal_free_reserved() function. It is never used. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20200520133119.1383-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: use lock for checking free blocks while retryingRitesh Harjani2020-06-043-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently while doing block allocation grp->bb_free may be getting modified if discard is happening in parallel. For e.g. consider a case where there are lot of threads who have preallocated lot of blocks and there is a thread which is trying to discard all of this group's PA. Now it could happen that we see all of those group's bb_free is zero and fail the allocation while there is sufficient space if we free up all the PA. So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set if we are unable to allocate any blocks in the first try (since we may not have considered blocks about to be discarded from PA lists). So during retry attempt to allocate blocks we will use ext4_lock_group() for checking if the group is good or not. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: refactor ext4_mb_good_group()Ritesh Harjani2020-06-041-28/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4_mb_good_group() definition was changed some time back and now it even initializes the buddy cache (via ext4_mb_init_group()), if in case the EXT4_MB_GRP_NEED_INIT() is true for a group. Note that ext4_mb_init_group() could sleep and so should not be called under a spinlock held. This is fine as of now because ext4_mb_good_group() is called before loading the buddy bitmap without ext4_lock_group() held and again called after loading the bitmap, only this time with ext4_lock_group() held. But still this whole thing is confusing. So this patch refactors out ext4_mb_good_group_nolock() which should be called when without holding ext4_lock_group(). Also in further patches we hold the spinlock (ext4_lock_group()) while doing any calculations which involves grp->bb_free or grp->bb_fragments. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handlingRitesh Harjani2020-06-041-5/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There could be a race in function ext4_mb_discard_group_preallocations() where the 1st thread may iterate through group's bb_prealloc_list and remove all the PAs and add to function's local list head. Now if the 2nd thread comes in to discard the group preallocations, it will see that the group->bb_prealloc_list is empty and will return 0. Consider for a case where we have less number of groups (for e.g. just group 0), this may even return an -ENOSPC error from ext4_mb_new_blocks() (where we call for ext4_mb_discard_group_preallocations()). But that is wrong, since 2nd thread should have waited for 1st thread to release all the PAs and should have retried for allocation. Since 1st thread was anyway going to discard the PAs. The algorithm using this percpu seq counter goes below: 1. We sample the percpu discard_pa_seq counter before trying for block allocation in ext4_mb_new_blocks(). 2. We increment this percpu discard_pa_seq counter when we either allocate or free these blocks i.e. while marking those blocks as used/free in mb_mark_used()/mb_free_blocks(). 3. We also increment this percpu seq counter when we successfully identify that the bb_prealloc_list is not empty and hence proceed for discarding of those PAs inside ext4_mb_discard_group_preallocations(). Now to make sure that the regular fast path of block allocation is not affected, as a small optimization we only sample the percpu seq counter on that cpu. Only when the block allocation fails and when freed blocks found were 0, that is when we sample percpu seq counter for all cpus using below function ext4_get_discard_pa_seq_sum(). This happens after making sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. It can be well argued that why don't just check for grp->bb_free to see if there are any free blocks to be allocated. So here are the two concerns which were discussed:- 1. If for some reason the blocks available in the group are not appropriate for allocation logic (say for e.g. EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then the retry logic may result into infinte looping since grp->bb_free is non-zero. 2. Also before preallocation was clubbed with block allocation with the same ext4_lock_group() held, there were lot of races where grp->bb_free could not be reliably relied upon. Due to above, this patch considers discard_pa_seq logic to determine if we should retry for block allocation. Say if there are are n threads trying for block allocation and none of those could allocate or discard any of the blocks, then all of those n threads will fail the block allocation and return -ENOSPC error. (Since the seq counter for all of those will match as no block allocation/discard was done during that duration). Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: refactor ext4_mb_discard_preallocations()Ritesh Harjani2020-06-041-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | Implement ext4_mb_discard_preallocations_should_retry() which we will need in later patches to add more logic like check for sequence number match to see if we should retry for block allocation or not. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: add blocks to PA list under same spinlock after allocating blocksRitesh Harjani2020-06-041-35/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list of every group to discard the group's PA to free up the space if allocation request fails. Consider below race:- Process A Process B 1. allocate blocks 1. Fails block allocation from ext4_mb_regular_allocator() ext4_lock_group() allocated blocks more than ac_o_ex.fe_len ext4_unlock_group() 2. Scans the grp->bb_prealloc_list (under ext4_lock_group()) and find nothing and thus return -ENOSPC. 2. Add the additional blocks to PA list ext4_lock_group() add blocks to grp->bb_prealloc_list ext4_unlock_group() Above race could be avoided if we add those additional blocks to grp->bb_prealloc_list at the same time with block allocation when ext4_lock_group() was still held. With this discard-PA will know if there are actually any blocks which could be freed from the PA Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: add casefold flag to EXT4_INODE_* flagsEric Biggers2020-06-041-1/+3
| | | | | | | | | | | | | | | | | | | | | | No one currently needs EXT4_INODE_CASEFOLD, but add it to keep the EXT4_INODE_* definitions in sync with the EXT4_*_FL definitions. Also make it clearer that the casefold flag is only for directories. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20200510215252.87833-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: rework map struct instantiation in ext4_ext_map_blocks()Eric Whitney2020-06-041-25/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The path performing block allocations in ext4_ext_map_blocks() contains code trimming the length of a new extent that is repeated later in the function. This code is both redundant and unnecessary as the exact length of the new extent has already been calculated. Rewrite the instantiation of the map struct in this case to use the available values, avoiding the overhead of unnecessary conversions and improving clarity. Add another map struct instantiation tailored specifically to the separate case for an existing written extent. Remove an old comment that no longer appears applicable to the current code. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20200510155805.18808-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
* | ext4: make ext_debug() implementation to use pr_debug()Ritesh Harjani2020-06-044-88/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext_debug() msgs could be helpful, provided those could be enabled without recompiling kernel and also if we could selectively enable only required prints for case by case debugging. So make ext_debug() implementation use pr_debug(). Also change ext_debug() to be defined with CONFIG_EXT4_DEBUG. So EXT_DEBUG macro now mostly remain for below 3 functions. ext4_ext_show_path/leaf/move() (whose print msgs use ext_debug() which again could be dynamically enabled using pr_debug()) This also changes the ext_debug() to take inode as a parameter to add inode no. in all of it's msgs. Prints additional info like process name / pid, superblock id etc. This also removes any explicit function names passed in ext_debug(). Since ext_debug() on it's own prints file, func and line no. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d31dc189b0aeda9384fe7665e36da7cd8c61571f.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: make mb_debug() implementation to use pr_debug()Ritesh Harjani2020-06-043-68/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mb_debug() msg had only 1 control level for all type of msgs. And if we enable mballoc_debug then all of those msgs would be enabled. Instead of adding multiple debug levels for mb_debug() msgs, use pr_debug() with which we could have finer control to print msgs at all of different levels (i.e. at file, func, line no.). Also add process name/pid, superblk id, and other info in mb_debug() msg. This also kills the mballoc_debug module parameter, since it is not needed any more. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: replace EXT_DEBUG with __maybe_unused in ↵Ritesh Harjani2020-06-041-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | ext4_ext_handle_unwritten_extents() Replace EXT_DEBUG with __maybe_unused from inside ext4_ext_handle_unwritten_extents() function. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/ae335b94506cd9db9d2648c1f4dd25a80f9f3ce2.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: improve ext_debug() msg in case of block allocation failureRitesh Harjani2020-06-042-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4_map_blocks() has ext_debug msg early at the start of function. We also get ext_debug msg if we could allocate a block from ext4_ext_map_blocks(). But there is no ext_debug() msg in case of block allocation failure. So add one along with error code. Also add more info in ext_debug() msg like how many blocks were allocated v/s how many were requested in ext4_ext_map_blocks(). Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1610ec2aa932396be00f9d552fe29da473ead176.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: use BIT() macro for BH_** state bitsRitesh Harjani2020-06-042-6/+6
| | | | | | | | | | | | | | | | | | | | | | Simply use BIT() macro for all BH_** state bits instead of open coding it. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/57667689f51a3f9dba2fcef7d3425187fa3ba69f.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: balloc: use task_pid_nr() helperRitesh Harjani2020-06-041-2/+3
| | | | | | | | | | | | | | | | | | Use task_pid_nr() function instead of current->pid. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/4b58403e15e9c8deb34a1b93deb3fc9cd153ab84.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECKRitesh Harjani2020-06-041-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure to check for e4b->bd_info->bb_bitmap == NULL, in mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr dereference. Similar to how we do this in other ifdef DOUBLE_CHECK functions. Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap() fails. We should simply mark grp->bb_bitmap as NULL if above happens. In fact ext4_read_block_bitmap() may even return an error in case of resize ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger this). Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: refactor code inside DOUBLE_CHECK into separate functionRitesh Harjani2020-06-041-17/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch implemets mb_group_bb_bitmap_alloc() and mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro and it's related code from inside ext4_mb_add_groupinfo()/ext4_mb_release(). There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8c2095d74b779f0254a19b24982490dc6f07c4f9.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: make ext4_mb_use_preallocated() return type as boolRitesh Harjani2020-06-041-7/+7
| | | | | | | | | | | | | | | | | | | | | | Change return type of function ext4_mb_use_preallocated() to bool to better reflect what this function can return. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/7880cb6ef911465beafefcd7e9c3ea214688744b.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: simplify error handling in ext4_init_mballoc()Ritesh Harjani2020-06-041-10/+13
| | | | | | | | | | | | | | | | | | | | | | This patch simplifies error handling logic in ext4_init_mballoc(), by adding all the cleanups at one place at the end of that function. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8621a7bc68f7107a9ac4292afeb784515333bd25.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: fix few other format specifier in mb_debug()Ritesh Harjani2020-06-041-6/+6
| | | | | | | | | | | | | | | | | | Fix few other format specifiers in mb_debug() msgs. As such no other functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/574fa7f833abf2dbf3b53a2fea3195e71f6cdbd8.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: mballoc: correct the mb_debug() format specifier for pa_len varRitesh Harjani2020-06-041-5/+5
| | | | | | | | | | | | | | | | | | | | | | pa->pa_len is an integer. Fix all of the format specifier used in mb_debug() for pa_len to %d instead of %u. As such no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/af4987f643c586f62bcc9961e43f0a67151d5551.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>