diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/api-summary.rst | 9 | ||||
-rw-r--r-- | Documentation/filesystems/erofs.rst | 175 | ||||
-rw-r--r-- | Documentation/filesystems/ext4/directory.rst | 27 | ||||
-rw-r--r-- | Documentation/filesystems/f2fs.rst | 14 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/locking.rst | 13 | ||||
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 526 | ||||
-rw-r--r-- | Documentation/filesystems/overlayfs.rst | 26 | ||||
-rw-r--r-- | Documentation/filesystems/proc.rst | 4 | ||||
-rw-r--r-- | Documentation/filesystems/vfat.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.rst | 15 |
11 files changed, 719 insertions, 93 deletions
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst index a94f17d9b836..7e5c04c98619 100644 --- a/Documentation/filesystems/api-summary.rst +++ b/Documentation/filesystems/api-summary.rst @@ -101,6 +101,9 @@ Other Functions .. kernel-doc:: fs/xattr.c :export: +.. kernel-doc:: fs/namespace.c + :export: + The proc filesystem =================== @@ -122,6 +125,12 @@ Events based on file descriptors .. kernel-doc:: fs/eventfd.c :export: +eventpoll (epoll) interfaces +============================ + +.. kernel-doc:: fs/eventpoll.c + :internal: + The Filesystem for Exporting Kernel Objects =========================================== diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst index bf145171c2bf..832839fcf4c3 100644 --- a/Documentation/filesystems/erofs.rst +++ b/Documentation/filesystems/erofs.rst @@ -50,8 +50,8 @@ Here is the main features of EROFS: - Support POSIX.1e ACLs by using xattrs; - - Support transparent file compression as an option: - LZ4 algorithm with 4 KB fixed-sized output compression for high performance. + - Support transparent data compression as an option: + LZ4 algorithm with the fixed-sized output compression for high performance. The following git tree provides the file system user-space tools under development (ex, formatting tool mkfs.erofs): @@ -113,31 +113,31 @@ may not. All metadatas can be now observed in two different spaces (views): :: - |-> aligned with 8B - |-> followed closely - + meta_blkaddr blocks |-> another slot - _____________________________________________________________________ - | ... | inode | xattrs | extents | data inline | ... | inode ... - |________|_______|(optional)|(optional)|__(optional)_|_____|__________ - |-> aligned with the inode slot size - . . - . . - . . - . . - . . - . . - .____________________________________________________|-> aligned with 4B - | xattr_ibody_header | shared xattrs | inline xattrs | - |____________________|_______________|_______________| - |-> 12 bytes <-|->x * 4 bytes<-| . - . . . - . . . - . . . - ._______________________________.______________________. - | id | id | id | id | ... | id | ent | ... | ent| ... | - |____|____|____|____|______|____|_____|_____|____|_____| - |-> aligned with 4B - |-> aligned with 4B + |-> aligned with 8B + |-> followed closely + + meta_blkaddr blocks |-> another slot + _____________________________________________________________________ + | ... | inode | xattrs | extents | data inline | ... | inode ... + |________|_______|(optional)|(optional)|__(optional)_|_____|__________ + |-> aligned with the inode slot size + . . + . . + . . + . . + . . + . . + .____________________________________________________|-> aligned with 4B + | xattr_ibody_header | shared xattrs | inline xattrs | + |____________________|_______________|_______________| + |-> 12 bytes <-|->x * 4 bytes<-| . + . . . + . . . + . . . + ._______________________________.______________________. + | id | id | id | id | ... | id | ent | ... | ent| ... | + |____|____|____|____|______|____|_____|_____|____|_____| + |-> aligned with 4B + |-> aligned with 4B Inode could be 32 or 64 bytes, which can be distinguished from a common field which all inode versions have -- i_format:: @@ -175,13 +175,13 @@ may not. All metadatas can be now observed in two different spaces (views): Each share xattr can also be directly found by the following formula: xattr offset = xattr_blkaddr * block_size + 4 * xattr_id - :: +:: - |-> aligned by 4 bytes - + xattr_blkaddr blocks |-> aligned with 4 bytes - _________________________________________________________________________ - | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... - |________|_____________|_____________|_____|______________|_______________ + |-> aligned by 4 bytes + + xattr_blkaddr blocks |-> aligned with 4 bytes + _________________________________________________________________________ + | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ... + |________|_____________|_____________|_____|______________|_______________ Directories ----------- @@ -193,48 +193,77 @@ algorithm (could refer to the related source code). :: - ___________________________ - / | - / ______________|________________ - / / | nameoff1 | nameoffN-1 - ____________.______________._______________v________________v__________ - | dirent | dirent | ... | dirent | filename | filename | ... | filename | - |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| - \ ^ - \ | * could have - \ | trailing '\0' - \________________________| nameoff0 - - Directory block + ___________________________ + / | + / ______________|________________ + / / | nameoff1 | nameoffN-1 + ____________.______________._______________v________________v__________ + | dirent | dirent | ... | dirent | filename | filename | ... | filename | + |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____| + \ ^ + \ | * could have + \ | trailing '\0' + \________________________| nameoff0 + Directory block Note that apart from the offset of the first filename, nameoff0 also indicates the total number of directory entries in this block since it is no need to introduce another on-disk field at all. -Compression ------------ -Currently, EROFS supports 4KB fixed-sized output transparent file compression, -as illustrated below:: - - |---- Variant-Length Extent ----|-------- VLE --------|----- VLE ----- - clusterofs clusterofs clusterofs - | | | logical data - _________v_______________________________v_____________________v_______________ - ... | . | | . | | . | ... - ____|____.________|_____________|________.____|_____________|__.__________|____ - |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-| - size size size size size - . . . . - . . . . - . . . . - _______._____________._____________._____________._____________________ - ... | | | | ... physical data - _______|_____________|_____________|_____________|_____________________ - |-> cluster <-|-> cluster <-|-> cluster <-| - size size size - -Currently each on-disk physical cluster can contain 4KB (un)compressed data -at most. For each logical cluster, there is a corresponding on-disk index to -describe its cluster type, physical cluster address, etc. - -See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. +Data compression +---------------- +EROFS implements LZ4 fixed-sized output compression which generates fixed-sized +compressed data blocks from variable-sized input in contrast to other existing +fixed-sized input solutions. Relatively higher compression ratios can be gotten +by using fixed-sized output compression since nowadays popular data compression +algorithms are mostly LZ77-based and such fixed-sized output approach can be +benefited from the historical dictionary (aka. sliding window). + +In details, original (uncompressed) data is turned into several variable-sized +extents and in the meanwhile, compressed into physical clusters (pclusters). +In order to record each variable-sized extent, logical clusters (lclusters) are +introduced as the basic unit of compress indexes to indicate whether a new +extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now +fixed in block size, as illustrated below:: + + |<- variable-sized extent ->|<- VLE ->| + clusterofs clusterofs clusterofs + | | | + _________v_________________________________v_______________________v________ + ... | . | | . | | . ... + ____|____._________|______________|________.___ _|______________|__.________ + |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-| + (HEAD) (NONHEAD) (HEAD) (NONHEAD) . + . CBLKCNT . . + . . . + . . . + _______._____________________________.______________._________________ + ... | | | | ... + _______|______________|______________|______________|_________________ + |-> big pcluster <-|-> pcluster <-| + +A physical cluster can be seen as a container of physical compressed blocks +which contains compressed data. Previously, only lcluster-sized (4KB) pclusters +were supported. After big pcluster feature is introduced (available since +Linux v5.13), pcluster can be a multiple of lcluster size. + +For each HEAD lcluster, clusterofs is recorded to indicate where a new extent +starts and blkaddr is used to seek the compressed data. For each NONHEAD +lcluster, delta0 and delta1 are available instead of blkaddr to indicate the +distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is +also a HEAD lcluster except that its data is uncompressed. See the comments +around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details. + +If big pcluster is enabled, pcluster size in lclusters needs to be recorded as +well. Let the delta0 of the first NONHEAD lcluster store the compressed block +count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy +to understand its delta0 is constantly 1, as illustrated below:: + + __________________________________________________________ + | HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD | + |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_| + |<----- a big pcluster (with CBLKCNT) ------>|<-- -->| + a lcluster-sized pcluster (without CBLKCNT) ^ + +If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT, +but it's easy to know the size of such pcluster is 1 lcluster as well. diff --git a/Documentation/filesystems/ext4/directory.rst b/Documentation/filesystems/ext4/directory.rst index 073940cc64ed..55f618b37144 100644 --- a/Documentation/filesystems/ext4/directory.rst +++ b/Documentation/filesystems/ext4/directory.rst @@ -121,6 +121,31 @@ The directory file type is one of the following values: * - 0x7 - Symbolic link. +To support directories that are both encrypted and casefolded directories, we +must also include hash information in the directory entry. We append +``ext4_extended_dir_entry_2`` to ``ext4_dir_entry_2`` except for the entries +for dot and dotdot, which are kept the same. The structure follows immediately +after ``name`` and is included in the size listed by ``rec_len`` If a directory +entry uses this extension, it may be up to 271 bytes. + +.. list-table:: + :widths: 8 8 24 40 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - \_\_le32 + - hash + - The hash of the directory name + * - 0x4 + - \_\_le32 + - minor\_hash + - The minor hash of the directory name + + In order to add checksums to these classic directory blocks, a phony ``struct ext4_dir_entry`` is placed at the end of each leaf block to hold the checksum. The directory entry is 12 bytes long. The inode @@ -322,6 +347,8 @@ The directory hash is one of the following values: - Half MD4, unsigned. * - 0x5 - Tea, unsigned. + * - 0x6 + - Siphash. Interior nodes of an htree are recorded as ``struct dx_node``, which is also the full length of a data block: diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index 35ed01a5fbc9..992bf91eeec8 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -110,6 +110,12 @@ background_gc=%s Turn on/off cleaning operations, namely garbage on synchronous garbage collection running in background. Default value for this option is on. So garbage collection is on by default. +gc_merge When background_gc is on, this option can be enabled to + let background GC thread to handle foreground GC requests, + it can eliminate the sluggish issue caused by slow foreground + GC operation when GC is triggered from a process with limited + I/O and CPU resources. +nogc_merge Disable GC merge feature. disable_roll_forward Disable the roll-forward recovery routine norecovery Disable the roll-forward recovery routine, mounted read- only (i.e., -o ro,disable_roll_forward) @@ -813,6 +819,14 @@ Compression implementation * chattr +c file * chattr +c dir; touch dir/file * mount w/ -o compress_extension=ext; touch file.ext + * mount w/ -o compress_extension=*; touch any_file + +- At this point, compression feature doesn't expose compressed space to user + directly in order to guarantee potential data updates later to the space. + Instead, the main goal is to reduce data writes to flash disk as much as + possible, resulting in extending disk life time as well as relaxing IO + congestion. Alternatively, we've added ioctl interface to reclaim compressed + space and show it to user after putting the immutable bit. Compress metadata layout:: diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 1f76b1cb3348..d4853cb919d2 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -53,6 +53,7 @@ filesystem implementations. journalling fscrypt fsverity + netfs_library Filesystems =========== diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index b7dcc86c92a4..1e894480115b 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -80,13 +80,16 @@ prototypes:: struct file *, unsigned open_flag, umode_t create_mode); int (*tmpfile) (struct inode *, struct dentry *, umode_t); + int (*fileattr_set)(struct user_namespace *mnt_userns, + struct dentry *dentry, struct fileattr *fa); + int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa); locking rules: all may block -============ ============================================= +============= ============================================= ops i_rwsem(inode) -============ ============================================= +============= ============================================= lookup: shared create: exclusive link: exclusive (both) @@ -107,7 +110,9 @@ fiemap: no update_time: no atomic_open: shared (exclusive if O_CREAT is set in open flags) tmpfile: no -============ ============================================= +fileattr_get: no or exclusive +fileattr_set: exclusive +============= ============================================= Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem @@ -469,7 +474,6 @@ prototypes:: int (*direct_access) (struct block_device *, sector_t, void **, unsigned long *); void (*unlock_native_capacity) (struct gendisk *); - int (*revalidate_disk) (struct gendisk *); int (*getgeo)(struct block_device *, struct hd_geometry *); void (*swap_slot_free_notify) (struct block_device *, unsigned long); @@ -484,7 +488,6 @@ ioctl: no compat_ioctl: no direct_access: no unlock_native_capacity: no -revalidate_disk: no getgeo: no swap_slot_free_notify: no (see below) ======================= =================== diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst new file mode 100644 index 000000000000..57a641847818 --- /dev/null +++ b/Documentation/filesystems/netfs_library.rst @@ -0,0 +1,526 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================= +NETWORK FILESYSTEM HELPER LIBRARY +================================= + +.. Contents: + + - Overview. + - Buffered read helpers. + - Read helper functions. + - Read helper structures. + - Read helper operations. + - Read helper procedure. + - Read helper cache API. + + +Overview +======== + +The network filesystem helper library is a set of functions designed to aid a +network filesystem in implementing VM/VFS operations. For the moment, that +just includes turning various VM buffered read operations into requests to read +from the server. The helper library, however, can also interpose other +services, such as local caching or local data encryption. + +Note that the library module doesn't link against local caching directly, so +access must be provided by the netfs. + + +Buffered Read Helpers +===================== + +The library provides a set of read helpers that handle the ->readpage(), +->readahead() and much of the ->write_begin() VM operations and translate them +into a common call framework. + +The following services are provided: + + * Handles transparent huge pages (THPs). + + * Insulates the netfs from VM interface changes. + + * Allows the netfs to arbitrarily split reads up into pieces, even ones that + don't match page sizes or page alignments and that may cross pages. + + * Allows the netfs to expand a readahead request in both directions to meet + its needs. + + * Allows the netfs to partially fulfil a read, which will then be resubmitted. + + * Handles local caching, allowing cached data and server-read data to be + interleaved for a single request. + + * Handles clearing of bufferage that aren't on the server. + + * Handle retrying of reads that failed, switching reads from the cache to the + server as necessary. + + * In the future, this is a place that other services can be performed, such as + local encryption of data to be stored remotely or in the cache. + +From the network filesystem, the helpers require a table of operations. This +includes a mandatory method to issue a read operation along with a number of +optional methods. + + +Read Helper Functions +--------------------- + +Three read helpers are provided:: + + * void netfs_readahead(struct readahead_control *ractl, + const struct netfs_read_request_ops *ops, + void *netfs_priv);`` + * int netfs_readpage(struct file *file, + struct page *page, + const struct netfs_read_request_ops *ops, + void *netfs_priv); + * int netfs_write_begin(struct file *file, + struct address_space *mapping, + loff_t pos, + unsigned int len, + unsigned int flags, + struct page **_page, + void **_fsdata, + const struct netfs_read_request_ops *ops, + void *netfs_priv); + +Each corresponds to a VM operation, with the addition of a couple of parameters +for the use of the read helpers: + + * ``ops`` + + A table of operations through which the helpers can talk to the filesystem. + + * ``netfs_priv`` + + Filesystem private data (can be NULL). + +Both of these values will be stored into the read request structure. + +For ->readahead() and ->readpage(), the network filesystem should just jump +into the corresponding read helper; whereas for ->write_begin(), it may be a +little more complicated as the network filesystem might want to flush +conflicting writes or track dirty data and needs to put the acquired page if an +error occurs after calling the helper. + +The helpers manage the read request, calling back into the network filesystem +through the suppplied table of operations. Waits will be performed as +necessary before returning for helpers that are meant to be synchronous. + +If an error occurs and netfs_priv is non-NULL, ops->cleanup() will be called to +deal with it. If some parts of the request are in progress when an error +occurs, the request will get partially completed if sufficient data is read. + +Additionally, there is:: + + * void netfs_subreq_terminated(struct netfs_read_subrequest *subreq, + ssize_t transferred_or_error, + bool was_async); + +which should be called to complete a read subrequest. This is given the number +of bytes transferred or a negative error code, plus a flag indicating whether +the operation was asynchronous (ie. whether the follow-on processing can be +done in the current context, given this may involve sleeping). + + +Read Helper Structures +---------------------- + +The read helpers make use of a couple of structures to maintain the state of +the read. The first is a structure that manages a read request as a whole:: + + struct netfs_read_request { + struct inode *inode; + struct address_space *mapping; + struct netfs_cache_resources cache_resources; + void *netfs_priv; + loff_t start; + size_t len; + loff_t i_size; + const struct netfs_read_request_ops *netfs_ops; + unsigned int debug_id; + ... + }; + +The above fields are the ones the netfs can use. They are: + + * ``inode`` + * ``mapping`` + + The inode and the address space of the file being read from. The mapping + may or may not point to inode->i_data. + + * ``cache_resources`` + + Resources for the local cache to use, if present. + + * ``netfs_priv`` + + The network filesystem's private data. The value for this can be passed in + to the helper functions or set during the request. The ->cleanup() op will + be called if this is non-NULL at the end. + + * ``start`` + * ``len`` + + The file position of the start of the read request and the length. These + may be altered by the ->expand_readahead() op. + + * ``i_size`` + + The size of the file at the start of the request. + + * ``netfs_ops`` + + A pointer to the operation table. The value for this is passed into the + helper functions. + + * ``debug_id`` + + A number allocated to this operation that can be displayed in trace lines + for reference. + + +The second structure is used to manage individual slices of the overall read +request:: + + struct netfs_read_subrequest { + struct netfs_read_request *rreq; + loff_t start; + size_t len; + size_t transferred; + unsigned long flags; + unsigned short debug_index; + ... + }; + +Each subrequest is expected to access a single source, though the helpers will +handle falling back from one source type to another. The members are: + + * ``rreq`` + + A pointer to the read request. + + * ``start`` + * ``len`` + + The file position of the start of this slice of the read request and the + length. + + * ``transferred`` + + The amount of data transferred so far of the length of this slice. The + network filesystem or cache should start the operation this far into the + slice. If a short read occurs, the helpers will call again, having updated + this to reflect the amount read so far. + + * ``flags`` + + Flags pertaining to the read. There are two of interest to the filesystem + or cache: + + * ``NETFS_SREQ_CLEAR_TAIL`` + + This can be set to indicate that the remainder of the slice, from + transferred to len, should be cleared. + + * ``NETFS_SREQ_SEEK_DATA_READ`` + + This is a hint to the cache that it might want to try skipping ahead to + the next data (ie. using SEEK_DATA). + + * ``debug_index`` + + A number allocated to this slice that can be displayed in trace lines for + reference. + + +Read Helper Operations +---------------------- + +The network filesystem must provide the read helpers with a table of operations +through which it can issue requests and negotiate:: + + struct netfs_read_request_ops { + void (*init_rreq)(struct netfs_read_request *rreq, struct file *file); + bool (*is_cache_enabled)(struct inode *inode); + int (*begin_cache_operation)(struct netfs_read_request *rreq); + void (*expand_readahead)(struct netfs_read_request *rreq); + bool (*clamp_length)(struct netfs_read_subrequest *subreq); + void (*issue_op)(struct netfs_read_subrequest *subreq); + bool (*is_still_valid)(struct netfs_read_request *rreq); + int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, + struct page *page, void **_fsdata); + void (*done)(struct netfs_read_request *rreq); + void (*cleanup)(struct address_space *mapping, void *netfs_priv); + }; + +The operations are as follows: + + * ``init_rreq()`` + + [Optional] This is called to initialise the request structure. It is given + the file for reference and can modify the ->netfs_priv value. + + * ``is_cache_enabled()`` + + [Required] This is called by netfs_write_begin() to ask if the file is being + cached. It should return true if it is being cached and false otherwise. + + * ``begin_cache_operation()`` + + [Optional] This is called to ask the network filesystem to call into the + cache (if present) to initialise the caching state for this read. The netfs + library module cannot access the cache directly, so the cache should call + something like fscache_begin_read_operation() to do this. + + The cache gets to store its state in ->cache_resources and must set a table + of operations of its own there (though of a different type). + + This should return 0 on success and an error code otherwise. If an error is + reported, the operation may proceed anyway, just without local caching (only + out of memory and interruption errors cause failure here). + + * ``expand_readahead()`` + + [Optional] This is called to allow the filesystem to expand the size of a + readahead read request. The filesystem gets to expand the request in both + directions, though it's not permitted to reduce it as the numbers may + represent an allocation already made. If local caching is enabled, it gets + to expand the request first. + + Expansion is communicated by changing ->start and ->len in the request + structure. Note that if any change is made, ->len must be increased by at + least as much as ->start is reduced. + + * ``clamp_length()`` + + [Optional] This is called to allow the filesystem to reduce the size of a + subrequest. The filesystem can use this, for example, to chop up a request + that has to be split across multiple servers or to put multiple reads in + flight. + + This should return 0 on success and an error code on error. + + * ``issue_op()`` + + [Required] The helpers use this to dispatch a subrequest to the server for + reading. In the subrequest, ->start, ->len and ->transferred indicate what + data should be read from the server. + + There is no return value; the netfs_subreq_terminated() function should be + called to indicate whether or not the operation succeeded and how much data + it transferred. The filesystem also should not deal with setting pages + uptodate, unlocking them or dropping their refs - the helpers need to deal + with this as they have to coordinate with copying to the local cache. + + Note that the helpers have the pages locked, but not pinned. It is possible + to use the ITER_XARRAY iov iterator to refer to the range of the inode that + is being operated upon without the need to allocate large bvec tables. + + * ``is_still_valid()`` + + [Optional] This is called to find out if the data just read from the local + cache is still valid. It should return true if it is still valid and false + if not. If it's not still valid, it will be reread from the server. + + * ``check_write_begin()`` + + [Optional] This is called from the netfs_write_begin() helper once it has + allocated/grabbed the page to be modified to allow the filesystem to flush + conflicting state before allowing it to be modified. + + It should return 0 if everything is now fine, -EAGAIN if the page should be + regrabbed and any other error code to abort the operation. + + * ``done`` + + [Optional] This is called after the pages in the request have all been + unlocked (and marked uptodate if applicable). + + * ``cleanup`` + + [Optional] This is called as the request is being deallocated so that the + filesystem can clean up ->netfs_priv. + + + +Read Helper Procedure +--------------------- + +The read helpers work by the following general procedure: + + * Set up the request. + + * For readahead, allow the local cache and then the network filesystem to + propose expansions to the read request. This is then proposed to the VM. + If the VM cannot fully perform the expansion, a partially expanded read will + be performed, though this may not get written to the cache in its entirety. + + * Loop around slicing chunks off of the request to form subrequests: + + * If a local cache is present, it gets to do the slicing, otherwise the + helpers just try to generate maximal slices. + + * The network filesystem gets to clamp the size of each slice if it is to be + the source. This allows rsize and chunking to be implemented. + + * The helpers issue a read from the cache or a read from the server or just + clears the slice as appropriate. + + * The next slice begins at the end of the last one. + + * As slices finish being read, they terminate. + + * When all the subrequests have terminated, the subrequests are assessed and + any that are short or have failed are reissued: + + * Failed cache requests are issued against the server instead. + + * Failed server requests just fail. + + * Short reads against either source will be reissued against that source + provided they have transferred some more data: + + * The cache may need to skip holes that it can't do DIO from. + + * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the + end of the slice instead of reissuing. + + * Once the data is read, the pages that have been fully read/cleared: + + * Will be marked uptodate. + + * If a cache is present, will be marked with PG_fscache. + + * Unlocked + + * Any pages that need writing to the cache will then have DIO writes issued. + + * Synchronous operations will wait for reading to be complete. + + * Writes to the cache will proceed asynchronously and the pages will have the + PG_fscache mark removed when that completes. + + * The request structures will be cleaned up when everything has completed. + + +Read Helper Cache API +--------------------- + +When implementing a local cache to be used by the read helpers, two things are +required: some way for the network filesystem to initialise the caching for a +read request and a table of operations for the helpers to call. + +The network filesystem's ->begin_cache_operation() method is called to set up a +cache and this must call into the cache to do the work. If using fscache, for +example, the cache would call:: + + int fscache_begin_read_operation(struct netfs_read_request *rreq, + struct fscache_cookie *cookie); + +passing in the request pointer and the cookie corresponding to the file. + +The netfs_read_request object contains a place for the cache to hang its +state:: + + struct netfs_cache_resources { + const struct netfs_cache_ops *ops; + void *cache_priv; + void *cache_priv2; + }; + +This contains an operations table pointer and two private pointers. The +operation table looks like the following:: + + struct netfs_cache_ops { + void (*end_operation)(struct netfs_cache_resources *cres); + + void (*expand_readahead)(struct netfs_cache_resources *cres, + loff_t *_start, size_t *_len, loff_t i_size); + + enum netfs_read_source (*prepare_read)(struct netfs_read_subrequest *subreq, + loff_t i_size); + + int (*read)(struct netfs_cache_resources *cres, + loff_t start_pos, + struct iov_iter *iter, + bool seek_data, + netfs_io_terminated_t term_func, + void *term_func_priv); + + int (*write)(struct netfs_cache_resources *cres, + loff_t start_pos, + struct iov_iter *iter, + netfs_io_terminated_t term_func, + void *term_func_priv); + }; + +With a termination handler function pointer:: + + typedef void (*netfs_io_terminated_t)(void *priv, + ssize_t transferred_or_error, + bool was_async); + +The methods defined in the table are: + + * ``end_operation()`` + + [Required] Called to clean up the resources at the end of the read request. + + * ``expand_readahead()`` + + [Optional] Called at the beginning of a netfs_readahead() operation to allow + the cache to expand a request in either direction. This allows the cache to + size the request appropriately for the cache granularity. + + The function is passed poiners to the start and length in its parameters, + plus the size of the file for reference, and adjusts the start and length + appropriately. It should return one of: + + * ``NETFS_FILL_WITH_ZEROES`` + * ``NETFS_DOWNLOAD_FROM_SERVER`` + * ``NETFS_READ_FROM_CACHE`` + * ``NETFS_INVALID_READ`` + + to indicate whether the slice should just be cleared or whether it should be + downloaded from the server or read from the cache - or whether slicing + should be given up at the current point. + + * ``prepare_read()`` + + [Required] Called to configure the next slice of a request. ->start and + ->len in the subrequest indicate where and how big the next slice can be; + the cache gets to reduce the length to match its granularity requirements. + + * ``read()`` + + [Required] Called to read from the cache. The start file offset is given + along with an iterator to read to, which gives the length also. It can be + given a hint requesting that it seek forward from that start position for + data. + + Also provided is a pointer to a termination handler function and private + data to pass to that function. The termination function should be called + with the number of bytes transferred or an error code, plus a flag + indicating whether the termination is definitely happening in the caller's + context. + + * ``write()`` + + [Required] Called to write to the cache. The start file offset is given + along with an iterator to write from, which gives the length also. + + Also provided is a pointer to a termination handler function and private + data to pass to that function. The termination function should be called + with the number of bytes transferred or an error code, plus a flag + indicating whether the termination is definitely happening in the caller's + context. + +Note that these methods are passed a pointer to the cache resource structure, +not the read request structure as they could be used in other situations where +there isn't a read request structure as well, such as writing dirty data to the +cache. diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 78240e29b0bb..455ca86eb4fc 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -40,17 +40,17 @@ On 64bit systems, even if all overlay layers are not on the same underlying filesystem, the same compliant behavior could be achieved with the "xino" feature. The "xino" feature composes a unique object identifier from the real object st_ino and an underlying fsid index. - -If all underlying filesystems support NFS file handles and export file -handles with 32bit inode number encoding (e.g. ext4), overlay filesystem -will use the high inode number bits for fsid. Even when the underlying -filesystem uses 64bit inode numbers, users can still enable the "xino" -feature with the "-o xino=on" overlay mount option. That is useful for the -case of underlying filesystems like xfs and tmpfs, which use 64bit inode -numbers, but are very unlikely to use the high inode number bits. In case +The "xino" feature uses the high inode number bits for fsid, because the +underlying filesystems rarely use the high inode number bits. In case the underlying inode number does overflow into the high xino bits, overlay filesystem will fall back to the non xino behavior for that inode. +The "xino" feature can be enabled with the "-o xino=on" overlay mount option. +If all underlying filesystems support NFS file handles, the value of st_ino +for overlay filesystem objects is not only unique, but also persistent over +the lifetime of the filesystem. The "-o xino=auto" overlay mount option +enables the "xino" feature only if the persistent st_ino requirement is met. + The following table summarizes what can be expected in different overlay configurations. @@ -66,14 +66,13 @@ Inode properties | All layers | Y | Y | Y | Y | Y | Y | Y | Y | | on same fs | | | | | | | | | +--------------+-----+------+-----+------+--------+--------+--------+-------+ -| Layers not | N | Y | Y | N | N | Y | N | Y | +| Layers not | N | N | Y | N | N | Y | N | Y | | on same fs, | | | | | | | | | | xino=off | | | | | | | | | +--------------+-----+------+-----+------+--------+--------+--------+-------+ | xino=on/auto | Y | Y | Y | Y | Y | Y | Y | Y | -| | | | | | | | | | +--------------+-----+------+-----+------+--------+--------+--------+-------+ -| xino=on/auto,| N | Y | Y | N | N | Y | N | Y | +| xino=on/auto,| N | N | Y | N | N | Y | N | Y | | ino overflow | | | | | | | | | +--------------+-----+------+-----+------+--------+--------+--------+-------+ @@ -81,7 +80,6 @@ Inode properties /proc files, such as /proc/locks and /proc/self/fdinfo/<fd> of an inotify file descriptor. - Upper and Lower --------------- @@ -461,7 +459,7 @@ enough free bits in the inode number, then overlayfs will not be able to guarantee that the values of st_ino and st_dev returned by stat(2) and the value of d_ino returned by readdir(3) will act like on a normal filesystem. E.g. the value of st_dev may be different for two objects in the same -overlay filesystem and the value of st_ino for directory objects may not be +overlay filesystem and the value of st_ino for filesystem objects may not be persistent and could change even while the overlay filesystem is mounted, as summarized in the `Inode properties`_ table above. @@ -476,7 +474,7 @@ a crash or deadlock. Offline changes, when the overlay is not mounted, are allowed to the upper tree. Offline changes to the lower tree are only allowed if the -"metadata only copy up", "inode index", and "redirect_dir" features +"metadata only copy up", "inode index", "xino" and "redirect_dir" features have not been used. If the lower tree is modified and any of these features has been used, the behavior of the overlay is undefined, though it will not result in a crash or deadlock. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 48fbfc336ebf..81bfe3c800cc 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -540,7 +540,9 @@ encoded manner. The codes are the following: ac area is accountable nr swap space is not reserved for the area ht area uses huge tlb pages + sf synchronous page fault ar architecture specific flag + wf wipe on fork dd do not include area into core dump sd soft dirty flag mm mixed map area @@ -549,6 +551,8 @@ encoded manner. The codes are the following: mg mergable advise flag bt arm64 BTI guarded page mt arm64 MTE allocation tags are enabled + um userfaultfd missing tracking + uw userfaultfd wr-protect tracking == ======================================= Note that there is no guarantee that every flag and associated mnemonic will diff --git a/Documentation/filesystems/vfat.rst b/Documentation/filesystems/vfat.rst index e85d74e91295..760a4d83fdf9 100644 --- a/Documentation/filesystems/vfat.rst +++ b/Documentation/filesystems/vfat.rst @@ -189,7 +189,7 @@ VFAT MOUNT OPTIONS **discard** If set, issues discard/TRIM commands to the block device when blocks are freed. This is useful for SSD devices - and sparse/thinly-provisoned LUNs. + and sparse/thinly-provisioned LUNs. **nfs=stale_rw|nostale_ro** Enable this only if you want to export the FAT filesystem diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 2049bbf5e388..14c31eced416 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -441,6 +441,9 @@ As of kernel 2.6.22, the following members are defined: unsigned open_flag, umode_t create_mode); int (*tmpfile) (struct user_namespace *, struct inode *, struct dentry *, umode_t); int (*set_acl)(struct user_namespace *, struct inode *, struct posix_acl *, int); + int (*fileattr_set)(struct user_namespace *mnt_userns, + struct dentry *dentry, struct fileattr *fa); + int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa); }; Again, all methods are called without any locks being held, unless @@ -588,6 +591,18 @@ otherwise noted. atomically creating, opening and unlinking a file in given directory. +``fileattr_get`` + called on ioctl(FS_IOC_GETFLAGS) and ioctl(FS_IOC_FSGETXATTR) to + retrieve miscellaneous file flags and attributes. Also called + before the relevant SET operation to check what is being changed + (in this case with i_rwsem locked exclusive). If unset, then + fall back to f_op->ioctl(). + +``fileattr_set`` + called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to + change miscellaneous file flags and attributes. Callers hold + i_rwsem exclusive. If unset, then fall back to f_op->ioctl(). + The Address Space Object ======================== |