diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/autofs.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/cifs/ksmbd.rst | 10 | ||||
-rw-r--r-- | Documentation/filesystems/erofs.rst | 12 | ||||
-rw-r--r-- | Documentation/filesystems/ext4/orphan.rst | 44 | ||||
-rw-r--r-- | Documentation/filesystems/f2fs.rst | 21 | ||||
-rw-r--r-- | Documentation/filesystems/fscrypt.rst | 83 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/locks.rst | 17 | ||||
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 97 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/reexport.rst | 113 | ||||
-rw-r--r-- | Documentation/filesystems/proc.rst | 26 |
12 files changed, 302 insertions, 125 deletions
diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst index 681c6a492bc0..4f490278d22f 100644 --- a/Documentation/filesystems/autofs.rst +++ b/Documentation/filesystems/autofs.rst @@ -35,7 +35,7 @@ This document describes only the kernel module and the interactions required with any user-space program. Subsequent text refers to this as the "automount daemon" or simply "the daemon". -"autofs" is a Linux kernel module with provides the "autofs" +"autofs" is a Linux kernel module which provides the "autofs" filesystem type. Several "autofs" filesystems can be mounted and they can each be managed separately, or all managed by the same daemon. diff --git a/Documentation/filesystems/cifs/ksmbd.rst b/Documentation/filesystems/cifs/ksmbd.rst index a1326157d53f..b0d354fd8066 100644 --- a/Documentation/filesystems/cifs/ksmbd.rst +++ b/Documentation/filesystems/cifs/ksmbd.rst @@ -50,11 +50,11 @@ ksmbd.mountd (user space daemon) -------------------------------- ksmbd.mountd is userspace process to, transfer user account and password that -are registered using ksmbd.adduser(part of utils for user space). Further it +are registered using ksmbd.adduser (part of utils for user space). Further it allows sharing information parameters that parsed from smb.conf to ksmbd in kernel. For the execution part it has a daemon which is continuously running and connected to the kernel interface using netlink socket, it waits for the -requests(dcerpc and share/user info). It handles RPC calls (at a minimum few +requests (dcerpc and share/user info). It handles RPC calls (at a minimum few dozen) that are most important for file server from NetShareEnum and NetServerGetInfo. Complete DCE/RPC response is prepared from the user space and passed over to the associated kernel thread for the client. @@ -154,11 +154,11 @@ Each layer 1. Enable all component prints # sudo ksmbd.control -d "all" -2. Enable one of components(smb, auth, vfs, oplock, ipc, conn, rdma) +2. Enable one of components (smb, auth, vfs, oplock, ipc, conn, rdma) # sudo ksmbd.control -d "smb" -3. Show what prints are enable. - # cat/sys/class/ksmbd-control/debug +3. Show what prints are enabled. + # cat /sys/class/ksmbd-control/debug [smb] auth vfs oplock ipc conn [rdma] 4. Disable prints: diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst index b97579b7d8fb..01df283c7d04 100644 --- a/Documentation/filesystems/erofs.rst +++ b/Documentation/filesystems/erofs.rst @@ -19,9 +19,10 @@ It is designed as a better filesystem solution for the following scenarios: immutable and bit-for-bit identical to the official golden image for their releases due to security and other considerations and - - hope to save some extra storage space with guaranteed end-to-end performance - by using reduced metadata and transparent file compression, especially - for those embedded devices with limited memory (ex, smartphone); + - hope to minimize extra storage space with guaranteed end-to-end performance + by using compact layout, transparent file compression and direct access, + especially for those embedded devices with limited memory and high-density + hosts with numerous containers; Here is the main features of EROFS: @@ -51,7 +52,9 @@ Here is the main features of EROFS: - Support POSIX.1e ACLs by using xattrs; - Support transparent data compression as an option: - LZ4 algorithm with the fixed-sized output compression for high performance. + LZ4 algorithm with the fixed-sized output compression for high performance; + + - Multiple device support for multi-layer container images. The following git tree provides the file system user-space tools under development (ex, formatting tool mkfs.erofs): @@ -87,6 +90,7 @@ cache_strategy=%s Select a strategy for cached decompression from now on: dax={always,never} Use direct access (no page cache). See Documentation/filesystems/dax.rst. dax A legacy option which is an alias for ``dax=always``. +device=%s Specify a path to an extra device to be used together. =================== ========================================================= On-disk details diff --git a/Documentation/filesystems/ext4/orphan.rst b/Documentation/filesystems/ext4/orphan.rst index bb19ecd1b626..03cca178864b 100644 --- a/Documentation/filesystems/ext4/orphan.rst +++ b/Documentation/filesystems/ext4/orphan.rst @@ -12,41 +12,31 @@ track the inode as orphan so that in case of crash extra blocks allocated to the file get truncated. Traditionally ext4 tracks orphan inodes in a form of single linked list where -superblock contains the inode number of the last orphan inode (s\_last\_orphan +superblock contains the inode number of the last orphan inode (s_last_orphan field) and then each inode contains inode number of the previously orphaned -inode (we overload i\_dtime inode field for this). However this filesystem +inode (we overload i_dtime inode field for this). However this filesystem global single linked list is a scalability bottleneck for workloads that result in heavy creation of orphan inodes. When orphan file feature -(COMPAT\_ORPHAN\_FILE) is enabled, the filesystem has a special inode -(referenced from the superblock through s\_orphan_file_inum) with several +(COMPAT_ORPHAN_FILE) is enabled, the filesystem has a special inode +(referenced from the superblock through s_orphan_file_inum) with several blocks. Each of these blocks has a structure: -.. list-table:: - :widths: 8 8 24 40 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - Array of \_\_le32 entries - - Orphan inode entries - - Each \_\_le32 entry is either empty (0) or it contains inode number of - an orphan inode. - * - blocksize - 8 - - \_\_le32 - - ob\_magic - - Magic value stored in orphan block tail (0x0b10ca04) - * - blocksize - 4 - - \_\_le32 - - ob\_checksum - - Checksum of the orphan block. +============= ================ =============== =============================== +Offset Type Name Description +============= ================ =============== =============================== +0x0 Array of Orphan inode Each __le32 entry is either + __le32 entries entries empty (0) or it contains + inode number of an orphan + inode. +blocksize-8 __le32 ob_magic Magic value stored in orphan + block tail (0x0b10ca04) +blocksize-4 __le32 ob_checksum Checksum of the orphan block. +============= ================ =============== =============================== When a filesystem with orphan file feature is writeably mounted, we set -RO\_COMPAT\_ORPHAN\_PRESENT feature in the superblock to indicate there may +RO_COMPAT_ORPHAN_PRESENT feature in the superblock to indicate there may be valid orphan entries. In case we see this feature when mounting the filesystem, we read the whole orphan file and process all orphan inodes found there as usual. When cleanly unmounting the filesystem we remove the -RO\_COMPAT\_ORPHAN\_PRESENT feature to avoid unnecessary scanning of the orphan +RO_COMPAT_ORPHAN_PRESENT feature to avoid unnecessary scanning of the orphan file and also make the filesystem fully compatible with older kernels. diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index 09de6ebbbdfa..d7b84695f56a 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -197,10 +197,29 @@ fault_type=%d Support configuring fault injection type, should be FAULT_DISCARD 0x000002000 FAULT_WRITE_IO 0x000004000 FAULT_SLAB_ALLOC 0x000008000 + FAULT_DQUOT_INIT 0x000010000 =================== =========== mode=%s Control block allocation mode which supports "adaptive" and "lfs". In "lfs" mode, there should be no random writes towards main area. + "fragment:segment" and "fragment:block" are newly added here. + These are developer options for experiments to simulate filesystem + fragmentation/after-GC situation itself. The developers use these + modes to understand filesystem fragmentation/after-GC condition well, + and eventually get some insights to handle them better. + In "fragment:segment", f2fs allocates a new segment in ramdom + position. With this, we can simulate the after-GC condition. + In "fragment:block", we can scatter block allocation with + "max_fragment_chunk" and "max_fragment_hole" sysfs nodes. + We added some randomness to both chunk and hole size to make + it close to realistic IO pattern. So, in this mode, f2fs will allocate + 1..<max_fragment_chunk> blocks in a chunk and make a hole in the + length of 1..<max_fragment_hole> by turns. With this, the newly + allocated blocks will be scattered throughout the whole partition. + Note that "fragment:block" implicitly enables "fragment:segment" + option for more randomness. + Please, use these options for your experiments and we strongly + recommend to re-format the filesystem after using these options. io_bits=%u Set the bit size of write IO requests. It should be set with "mode=lfs". usrquota Enable plain user disk quota accounting. @@ -283,7 +302,7 @@ compress_extension=%s Support adding specified extension, so that f2fs can enab For other files, we can still enable compression via ioctl. Note that, there is one reserved special extension '*', it can be set to enable compression for all files. -nocompress_extension=%s Support adding specified extension, so that f2fs can disable +nocompress_extension=%s Support adding specified extension, so that f2fs can disable compression on those corresponding files, just contrary to compression extension. If you know exactly which files cannot be compressed, you can use this. The same extension name can't appear in both compress and nocompress diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index 0eb799d9d05a..4d5d50dca65c 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -77,11 +77,11 @@ Side-channel attacks fscrypt is only resistant to side-channel attacks, such as timing or electromagnetic attacks, to the extent that the underlying Linux -Cryptographic API algorithms are. If a vulnerable algorithm is used, -such as a table-based implementation of AES, it may be possible for an -attacker to mount a side channel attack against the online system. -Side channel attacks may also be mounted against applications -consuming decrypted data. +Cryptographic API algorithms or inline encryption hardware are. If a +vulnerable algorithm is used, such as a table-based implementation of +AES, it may be possible for an attacker to mount a side channel attack +against the online system. Side channel attacks may also be mounted +against applications consuming decrypted data. Unauthorized file access ~~~~~~~~~~~~~~~~~~~~~~~~ @@ -176,11 +176,11 @@ Master Keys Each encrypted directory tree is protected by a *master key*. Master keys can be up to 64 bytes long, and must be at least as long as the -greater of the key length needed by the contents and filenames -encryption modes being used. For example, if AES-256-XTS is used for -contents encryption, the master key must be 64 bytes (512 bits). Note -that the XTS mode is defined to require a key twice as long as that -required by the underlying block cipher. +greater of the security strength of the contents and filenames +encryption modes being used. For example, if any AES-256 mode is +used, the master key must be at least 256 bits, i.e. 32 bytes. A +stricter requirement applies if the key is used by a v1 encryption +policy and AES-256-XTS is used; such keys must be 64 bytes. To "unlock" an encrypted directory tree, userspace must provide the appropriate master key. There can be any number of master keys, each @@ -1135,6 +1135,50 @@ where applications may later write sensitive data. It is recommended that systems implementing a form of "verified boot" take advantage of this by validating all top-level encryption policies prior to access. +Inline encryption support +========================= + +By default, fscrypt uses the kernel crypto API for all cryptographic +operations (other than HKDF, which fscrypt partially implements +itself). The kernel crypto API supports hardware crypto accelerators, +but only ones that work in the traditional way where all inputs and +outputs (e.g. plaintexts and ciphertexts) are in memory. fscrypt can +take advantage of such hardware, but the traditional acceleration +model isn't particularly efficient and fscrypt hasn't been optimized +for it. + +Instead, many newer systems (especially mobile SoCs) have *inline +encryption hardware* that can encrypt/decrypt data while it is on its +way to/from the storage device. Linux supports inline encryption +through a set of extensions to the block layer called *blk-crypto*. +blk-crypto allows filesystems to attach encryption contexts to bios +(I/O requests) to specify how the data will be encrypted or decrypted +in-line. For more information about blk-crypto, see +:ref:`Documentation/block/inline-encryption.rst <inline_encryption>`. + +On supported filesystems (currently ext4 and f2fs), fscrypt can use +blk-crypto instead of the kernel crypto API to encrypt/decrypt file +contents. To enable this, set CONFIG_FS_ENCRYPTION_INLINE_CRYPT=y in +the kernel configuration, and specify the "inlinecrypt" mount option +when mounting the filesystem. + +Note that the "inlinecrypt" mount option just specifies to use inline +encryption when possible; it doesn't force its use. fscrypt will +still fall back to using the kernel crypto API on files where the +inline encryption hardware doesn't have the needed crypto capabilities +(e.g. support for the needed encryption algorithm and data unit size) +and where blk-crypto-fallback is unusable. (For blk-crypto-fallback +to be usable, it must be enabled in the kernel configuration with +CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y.) + +Currently fscrypt always uses the filesystem block size (which is +usually 4096 bytes) as the data unit size. Therefore, it can only use +inline encryption hardware that supports that data unit size. + +Inline encryption doesn't affect the ciphertext or other aspects of +the on-disk format, so users may freely switch back and forth between +using "inlinecrypt" and not using "inlinecrypt". + Implementation details ====================== @@ -1184,6 +1228,13 @@ keys`_ and `DIRECT_KEY policies`_. Data path changes ----------------- +When inline encryption is used, filesystems just need to associate +encryption contexts with bios to specify how the block layer or the +inline encryption hardware will encrypt/decrypt the file contents. + +When inline encryption isn't used, filesystems must encrypt/decrypt +the file contents themselves, as described below: + For the read path (->readpage()) of regular files, filesystems can read the ciphertext into the page cache and decrypt it in-place. The page lock must be held until decryption has finished, to prevent the @@ -1197,18 +1248,6 @@ buffer. Some filesystems, such as UBIFS, already use temporary buffers regardless of encryption. Other filesystems, such as ext4 and F2FS, have to allocate bounce pages specially for encryption. -Fscrypt is also able to use inline encryption hardware instead of the -kernel crypto API for en/decryption of file contents. When possible, -and if directed to do so (by specifying the 'inlinecrypt' mount option -for an ext4/F2FS filesystem), it adds encryption contexts to bios and -uses blk-crypto to perform the en/decryption instead of making use of -the above read/write path changes. Of course, even if directed to -make use of inline encryption, fscrypt will only be able to do so if -either hardware inline encryption support is available for the -selected encryption algorithm or CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK -is selected. If neither is the case, fscrypt will fall back to using -the above mentioned read/write path changes for en/decryption. - Filename hashing and encoding ----------------------------- diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c0ad233963ae..bee63d42e5ec 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -29,7 +29,6 @@ algorithms work. fiemap files locks - mandatory-locking mount_api quota seq_file diff --git a/Documentation/filesystems/locks.rst b/Documentation/filesystems/locks.rst index c5ae858b1aac..26429317dbbc 100644 --- a/Documentation/filesystems/locks.rst +++ b/Documentation/filesystems/locks.rst @@ -57,16 +57,9 @@ fcntl(), with all the problems that implies. 1.3 Mandatory Locking As A Mount Option --------------------------------------- -Mandatory locking, as described in -'Documentation/filesystems/mandatory-locking.rst' was prior to this release a -general configuration option that was valid for all mounted filesystems. This -had a number of inherent dangers, not the least of which was the ability to -freeze an NFS server by asking it to read a file for which a mandatory lock -existed. - -From this release of the kernel, mandatory locking can be turned on and off -on a per-filesystem basis, using the mount options 'mand' and 'nomand'. -The default is to disallow mandatory locking. The intention is that -mandatory locking only be enabled on a local filesystem as the specific need -arises. +Mandatory locking was prior to this release a general configuration option +that was valid for all mounted filesystems. This had a number of inherent +dangers, not the least of which was the ability to freeze an NFS server by +asking it to read a file for which a mandatory lock existed. +Such option was dropped in Kernel v5.14. diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index 57a641847818..375baca7edcd 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -1,7 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0 ================================= -NETWORK FILESYSTEM HELPER LIBRARY +Network Filesystem Helper Library ================================= .. Contents: @@ -37,22 +37,22 @@ into a common call framework. The following services are provided: - * Handles transparent huge pages (THPs). + * Handle folios that span multiple pages. - * Insulates the netfs from VM interface changes. + * Insulate the netfs from VM interface changes. - * Allows the netfs to arbitrarily split reads up into pieces, even ones that - don't match page sizes or page alignments and that may cross pages. + * Allow the netfs to arbitrarily split reads up into pieces, even ones that + don't match folio sizes or folio alignments and that may cross folios. - * Allows the netfs to expand a readahead request in both directions to meet - its needs. + * Allow the netfs to expand a readahead request in both directions to meet its + needs. - * Allows the netfs to partially fulfil a read, which will then be resubmitted. + * Allow the netfs to partially fulfil a read, which will then be resubmitted. - * Handles local caching, allowing cached data and server-read data to be + * Handle local caching, allowing cached data and server-read data to be interleaved for a single request. - * Handles clearing of bufferage that aren't on the server. + * Handle clearing of bufferage that aren't on the server. * Handle retrying of reads that failed, switching reads from the cache to the server as necessary. @@ -70,22 +70,22 @@ Read Helper Functions Three read helpers are provided:: - * void netfs_readahead(struct readahead_control *ractl, - const struct netfs_read_request_ops *ops, - void *netfs_priv);`` - * int netfs_readpage(struct file *file, - struct page *page, - const struct netfs_read_request_ops *ops, - void *netfs_priv); - * int netfs_write_begin(struct file *file, - struct address_space *mapping, - loff_t pos, - unsigned int len, - unsigned int flags, - struct page **_page, - void **_fsdata, - const struct netfs_read_request_ops *ops, - void *netfs_priv); + void netfs_readahead(struct readahead_control *ractl, + const struct netfs_read_request_ops *ops, + void *netfs_priv); + int netfs_readpage(struct file *file, + struct folio *folio, + const struct netfs_read_request_ops *ops, + void *netfs_priv); + int netfs_write_begin(struct file *file, + struct address_space *mapping, + loff_t pos, + unsigned int len, + unsigned int flags, + struct folio **_folio, + void **_fsdata, + const struct netfs_read_request_ops *ops, + void *netfs_priv); Each corresponds to a VM operation, with the addition of a couple of parameters for the use of the read helpers: @@ -103,8 +103,8 @@ Both of these values will be stored into the read request structure. For ->readahead() and ->readpage(), the network filesystem should just jump into the corresponding read helper; whereas for ->write_begin(), it may be a little more complicated as the network filesystem might want to flush -conflicting writes or track dirty data and needs to put the acquired page if an -error occurs after calling the helper. +conflicting writes or track dirty data and needs to put the acquired folio if +an error occurs after calling the helper. The helpers manage the read request, calling back into the network filesystem through the suppplied table of operations. Waits will be performed as @@ -253,7 +253,7 @@ through which it can issue requests and negotiate:: void (*issue_op)(struct netfs_read_subrequest *subreq); bool (*is_still_valid)(struct netfs_read_request *rreq); int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, - struct page *page, void **_fsdata); + struct folio *folio, void **_fsdata); void (*done)(struct netfs_read_request *rreq); void (*cleanup)(struct address_space *mapping, void *netfs_priv); }; @@ -313,13 +313,14 @@ The operations are as follows: There is no return value; the netfs_subreq_terminated() function should be called to indicate whether or not the operation succeeded and how much data - it transferred. The filesystem also should not deal with setting pages + it transferred. The filesystem also should not deal with setting folios uptodate, unlocking them or dropping their refs - the helpers need to deal with this as they have to coordinate with copying to the local cache. - Note that the helpers have the pages locked, but not pinned. It is possible - to use the ITER_XARRAY iov iterator to refer to the range of the inode that - is being operated upon without the need to allocate large bvec tables. + Note that the helpers have the folios locked, but not pinned. It is + possible to use the ITER_XARRAY iov iterator to refer to the range of the + inode that is being operated upon without the need to allocate large bvec + tables. * ``is_still_valid()`` @@ -330,15 +331,15 @@ The operations are as follows: * ``check_write_begin()`` [Optional] This is called from the netfs_write_begin() helper once it has - allocated/grabbed the page to be modified to allow the filesystem to flush + allocated/grabbed the folio to be modified to allow the filesystem to flush conflicting state before allowing it to be modified. - It should return 0 if everything is now fine, -EAGAIN if the page should be + It should return 0 if everything is now fine, -EAGAIN if the folio should be regrabbed and any other error code to abort the operation. * ``done`` - [Optional] This is called after the pages in the request have all been + [Optional] This is called after the folios in the request have all been unlocked (and marked uptodate if applicable). * ``cleanup`` @@ -390,7 +391,7 @@ The read helpers work by the following general procedure: * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the end of the slice instead of reissuing. - * Once the data is read, the pages that have been fully read/cleared: + * Once the data is read, the folios that have been fully read/cleared: * Will be marked uptodate. @@ -398,11 +399,11 @@ The read helpers work by the following general procedure: * Unlocked - * Any pages that need writing to the cache will then have DIO writes issued. + * Any folios that need writing to the cache will then have DIO writes issued. * Synchronous operations will wait for reading to be complete. - * Writes to the cache will proceed asynchronously and the pages will have the + * Writes to the cache will proceed asynchronously and the folios will have the PG_fscache mark removed when that completes. * The request structures will be cleaned up when everything has completed. @@ -452,6 +453,9 @@ operation table looks like the following:: netfs_io_terminated_t term_func, void *term_func_priv); + int (*prepare_write)(struct netfs_cache_resources *cres, + loff_t *_start, size_t *_len, loff_t i_size); + int (*write)(struct netfs_cache_resources *cres, loff_t start_pos, struct iov_iter *iter, @@ -509,6 +513,14 @@ The methods defined in the table are: indicating whether the termination is definitely happening in the caller's context. + * ``prepare_write()`` + + [Required] Called to adjust a write to the cache and check that there is + sufficient space in the cache. The start and length values indicate the + size of the write that netfslib is proposing, and this can be adjusted by + the cache to respect DIO boundaries. The file size is passed for + information. + * ``write()`` [Required] Called to write to the cache. The start file offset is given @@ -524,3 +536,10 @@ Note that these methods are passed a pointer to the cache resource structure, not the read request structure as they could be used in other situations where there isn't a read request structure as well, such as writing dirty data to the cache. + + +API Function Reference +====================== + +.. kernel-doc:: include/linux/netfs.h +.. kernel-doc:: fs/netfs/read_helper.c diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index 65805624e39b..288d8ddb2bc6 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -11,3 +11,4 @@ NFS rpc-server-gss nfs41-server knfsd-stats + reexport diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst new file mode 100644 index 000000000000..ff9ae4a46530 --- /dev/null +++ b/Documentation/filesystems/nfs/reexport.rst @@ -0,0 +1,113 @@ +Reexporting NFS filesystems +=========================== + +Overview +-------- + +It is possible to reexport an NFS filesystem over NFS. However, this +feature comes with a number of limitations. Before trying it, we +recommend some careful research to determine whether it will work for +your purposes. + +A discussion of current known limitations follows. + +"fsid=" required, crossmnt broken +--------------------------------- + +We require the "fsid=" export option on any reexport of an NFS +filesystem. You can use "uuidgen -r" to generate a unique argument. + +The "crossmnt" export does not propagate "fsid=", so it will not allow +traversing into further nfs filesystems; if you wish to export nfs +filesystems mounted under the exported filesystem, you'll need to export +them explicitly, assigning each its own unique "fsid= option. + +Reboot recovery +--------------- + +The NFS protocol's normal reboot recovery mechanisms don't work for the +case when the reexport server reboots. Clients will lose any locks +they held before the reboot, and further IO will result in errors. +Closing and reopening files should clear the errors. + +Filehandle limits +----------------- + +If the original server uses an X byte filehandle for a given object, the +reexport server's filehandle for the reexported object will be X+22 +bytes, rounded up to the nearest multiple of four bytes. + +The result must fit into the RFC-mandated filehandle size limits: + ++-------+-----------+ +| NFSv2 | 32 bytes | ++-------+-----------+ +| NFSv3 | 64 bytes | ++-------+-----------+ +| NFSv4 | 128 bytes | ++-------+-----------+ + +So, for example, you will only be able to reexport a filesystem over +NFSv2 if the original server gives you filehandles that fit in 10 +bytes--which is unlikely. + +In general there's no way to know the maximum filehandle size given out +by an NFS server without asking the server vendor. + +But the following table gives a few examples. The first column is the +typical length of the filehandle from a Linux server exporting the given +filesystem, the second is the length after that nfs export is reexported +by another Linux host: + ++--------+-------------------+----------------+ +| | filehandle length | after reexport | ++========+===================+================+ +| ext4: | 28 bytes | 52 bytes | ++--------+-------------------+----------------+ +| xfs: | 32 bytes | 56 bytes | ++--------+-------------------+----------------+ +| btrfs: | 40 bytes | 64 bytes | ++--------+-------------------+----------------+ + +All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport, +but none are reexportable over NFSv2. + +Linux server filehandles are a bit more complicated than this, though; +for example: + + - The (non-default) "subtreecheck" export option generally + requires another 4 to 8 bytes in the filehandle. + - If you export a subdirectory of a filesystem (instead of + exporting the filesystem root), that also usually adds 4 to 8 + bytes. + - If you export over NFSv2, knfsd usually uses a shorter + filesystem identifier that saves 8 bytes. + - The root directory of an export uses a filehandle that is + shorter. + +As you can see, the 128-byte NFSv4 filehandle is large enough that +you're unlikely to have trouble using NFSv4 to reexport any filesystem +exported from a Linux server. In general, if the original server is +something that also supports NFSv3, you're *probably* OK. Re-exporting +over NFSv3 may be dicier, and reexporting over NFSv2 will probably +never work. + +For more details of Linux filehandle structure, the best reference is +the source code and comments; see in particular: + + - include/linux/exportfs.h:enum fid_type + - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new + - fs/nfsd/nfsfh.c:set_version_and_fsid_type + - fs/nfs/export.c:nfs_encode_fh + +Open DENY bits ignored +---------------------- + +NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which +allow you, for example, to open a file in a mode which forbids other +read opens or write opens. The Linux client doesn't use them, and the +server's support has always been incomplete: they are enforced only +against other NFS users, not against processes accessing the exported +filesystem locally. A reexport server will also not pass them along to +the original server, so they will not be enforced between clients of +different reexport servers. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 042c418f4090..8d7f141c6fc7 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1857,19 +1857,19 @@ For example:: This file contains lines of the form:: 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue - (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) - - (1) mount ID: unique identifier of the mount (may be reused after umount) - (2) parent ID: ID of parent (or of self for the top of the mount tree) - (3) major:minor: value of st_dev for files on filesystem - (4) root: root of the mount within the filesystem - (5) mount point: mount point relative to the process's root - (6) mount options: per mount options - (7) optional fields: zero or more fields of the form "tag[:value]" - (8) separator: marks the end of the optional fields - (9) filesystem type: name of filesystem of the form "type[.subtype]" - (10) mount source: filesystem specific information or "none" - (11) super options: per super block options + (1)(2)(3) (4) (5) (6) (n…m) (m+1)(m+2) (m+3) (m+4) + + (1) mount ID: unique identifier of the mount (may be reused after umount) + (2) parent ID: ID of parent (or of self for the top of the mount tree) + (3) major:minor: value of st_dev for files on filesystem + (4) root: root of the mount within the filesystem + (5) mount point: mount point relative to the process's root + (6) mount options: per mount options + (n…m) optional fields: zero or more fields of the form "tag[:value]" + (m+1) separator: marks the end of the optional fields + (m+2) filesystem type: name of filesystem of the form "type[.subtype]" + (m+3) mount source: filesystem specific information or "none" + (m+4) super options: per super block options Parsers should ignore all unrecognised optional fields. Currently the possible optional fields are: |