diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/api-summary.rst | 150 | ||||
-rw-r--r-- | Documentation/filesystems/binderfs.rst | 68 | ||||
-rw-r--r-- | Documentation/filesystems/exofs.txt | 185 | ||||
-rw-r--r-- | Documentation/filesystems/fscrypt.rst | 16 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 389 | ||||
-rw-r--r-- | Documentation/filesystems/journalling.rst | 184 | ||||
-rw-r--r-- | Documentation/filesystems/mount_api.txt | 709 | ||||
-rw-r--r-- | Documentation/filesystems/path-lookup.rst | 39 | ||||
-rw-r--r-- | Documentation/filesystems/splice.rst | 22 | ||||
-rw-r--r-- | Documentation/filesystems/sysfs.txt | 21 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.txt | 3 | ||||
-rw-r--r-- | Documentation/filesystems/xfs.txt | 3 |
12 files changed, 1223 insertions, 566 deletions
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst new file mode 100644 index 000000000000..aa51ffcfa029 --- /dev/null +++ b/Documentation/filesystems/api-summary.rst @@ -0,0 +1,150 @@ +============================= +Linux Filesystems API summary +============================= + +This section contains API-level documentation, mostly taken from the source +code itself. + +The Linux VFS +============= + +The Filesystem types +-------------------- + +.. kernel-doc:: include/linux/fs.h + :internal: + +The Directory Cache +------------------- + +.. kernel-doc:: fs/dcache.c + :export: + +.. kernel-doc:: include/linux/dcache.h + :internal: + +Inode Handling +-------------- + +.. kernel-doc:: fs/inode.c + :export: + +.. kernel-doc:: fs/bad_inode.c + :export: + +Registration and Superblocks +---------------------------- + +.. kernel-doc:: fs/super.c + :export: + +File Locks +---------- + +.. kernel-doc:: fs/locks.c + :export: + +.. kernel-doc:: fs/locks.c + :internal: + +Other Functions +--------------- + +.. kernel-doc:: fs/mpage.c + :export: + +.. kernel-doc:: fs/namei.c + :export: + +.. kernel-doc:: fs/buffer.c + :export: + +.. kernel-doc:: block/bio.c + :export: + +.. kernel-doc:: fs/seq_file.c + :export: + +.. kernel-doc:: fs/filesystems.c + :export: + +.. kernel-doc:: fs/fs-writeback.c + :export: + +.. kernel-doc:: fs/block_dev.c + :export: + +.. kernel-doc:: fs/anon_inodes.c + :export: + +.. kernel-doc:: fs/attr.c + :export: + +.. kernel-doc:: fs/d_path.c + :export: + +.. kernel-doc:: fs/dax.c + :export: + +.. kernel-doc:: fs/direct-io.c + :export: + +.. kernel-doc:: fs/file_table.c + :export: + +.. kernel-doc:: fs/libfs.c + :export: + +.. kernel-doc:: fs/posix_acl.c + :export: + +.. kernel-doc:: fs/stat.c + :export: + +.. kernel-doc:: fs/sync.c + :export: + +.. kernel-doc:: fs/xattr.c + :export: + +The proc filesystem +=================== + +sysctl interface +---------------- + +.. kernel-doc:: kernel/sysctl.c + :export: + +proc filesystem interface +------------------------- + +.. kernel-doc:: fs/proc/base.c + :internal: + +Events based on file descriptors +================================ + +.. kernel-doc:: fs/eventfd.c + :export: + +The Filesystem for Exporting Kernel Objects +=========================================== + +.. kernel-doc:: fs/sysfs/file.c + :export: + +.. kernel-doc:: fs/sysfs/symlink.c + :export: + +The debugfs filesystem +====================== + +debugfs interface +----------------- + +.. kernel-doc:: fs/debugfs/inode.c + :export: + +.. kernel-doc:: fs/debugfs/file.c + :export: diff --git a/Documentation/filesystems/binderfs.rst b/Documentation/filesystems/binderfs.rst new file mode 100644 index 000000000000..c009671f8434 --- /dev/null +++ b/Documentation/filesystems/binderfs.rst @@ -0,0 +1,68 @@ +.. SPDX-License-Identifier: GPL-2.0 + +The Android binderfs Filesystem +=============================== + +Android binderfs is a filesystem for the Android binder IPC mechanism. It +allows to dynamically add and remove binder devices at runtime. Binder devices +located in a new binderfs instance are independent of binder devices located in +other binderfs instances. Mounting a new binderfs instance makes it possible +to get a set of private binder devices. + +Mounting binderfs +----------------- + +Android binderfs can be mounted with:: + + mkdir /dev/binderfs + mount -t binder binder /dev/binderfs + +at which point a new instance of binderfs will show up at ``/dev/binderfs``. +In a fresh instance of binderfs no binder devices will be present. There will +only be a ``binder-control`` device which serves as the request handler for +binderfs. Mounting another binderfs instance at a different location will +create a new and separate instance from all other binderfs mounts. This is +identical to the behavior of e.g. ``devpts`` and ``tmpfs``. The Android +binderfs filesystem can be mounted in user namespaces. + +Options +------- +max + binderfs instances can be mounted with a limit on the number of binder + devices that can be allocated. The ``max=<count>`` mount option serves as + a per-instance limit. If ``max=<count>`` is set then only ``<count>`` number + of binder devices can be allocated in this binderfs instance. + +Allocating binder Devices +------------------------- + +.. _ioctl: http://man7.org/linux/man-pages/man2/ioctl.2.html + +To allocate a new binder device in a binderfs instance a request needs to be +sent through the ``binder-control`` device node. A request is sent in the form +of an `ioctl() <ioctl_>`_. + +What a program needs to do is to open the ``binder-control`` device node and +send a ``BINDER_CTL_ADD`` request to the kernel. Users of binderfs need to +tell the kernel which name the new binder device should get. By default a name +can only contain up to ``BINDERFS_MAX_NAME`` chars including the terminating +zero byte. + +Once the request is made via an `ioctl() <ioctl_>`_ passing a ``struct +binder_device`` with the name to the kernel it will allocate a new binder +device and return the major and minor number of the new device in the struct +(This is necessary because binderfs allocates a major device number +dynamically.). After the `ioctl() <ioctl_>`_ returns there will be a new +binder device located under /dev/binderfs with the chosen name. + +Deleting binder Devices +----------------------- + +.. _unlink: http://man7.org/linux/man-pages/man2/unlink.2.html +.. _rm: http://man7.org/linux/man-pages/man1/rm.1.html + +Binderfs binder devices can be deleted via `unlink() <unlink_>`_. This means +that the `rm() <rm_>`_ tool can be used to delete them. Note that the +``binder-control`` device cannot be deleted since this would make the binderfs +instance unuseable. The ``binder-control`` device will be deleted when the +binderfs instance is unmounted and all references to it have been dropped. diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt deleted file mode 100644 index 23583a136975..000000000000 --- a/Documentation/filesystems/exofs.txt +++ /dev/null @@ -1,185 +0,0 @@ -=============================================================================== -WHAT IS EXOFS? -=============================================================================== - -exofs is a file system that uses an OSD and exports the API of a normal Linux -file system. Users access exofs like any other local file system, and exofs -will in turn issue commands to the local OSD initiator. - -OSD is a new T10 command set that views storage devices not as a large/flat -array of sectors but as a container of objects, each having a length, quota, -time attributes and more. Each object is addressed by a 64bit ID, and is -contained in a 64bit ID partition. Each object has associated attributes -attached to it, which are integral part of the object and provide metadata about -the object. The standard defines some common obligatory attributes, but user -attributes can be added as needed. - -=============================================================================== -ENVIRONMENT -=============================================================================== - -To use this file system, you need to have an object store to run it on. You -may download a target from: -http://open-osd.org - -See Documentation/scsi/osd.txt for how to setup a working osd environment. - -=============================================================================== -USAGE -=============================================================================== - -1. Download and compile exofs and open-osd initiator: - You need an external Kernel source tree or kernel headers from your - distribution. (anything based on 2.6.26 or later). - - a. download open-osd including exofs source using: - [parent-directory]$ git clone git://git.open-osd.org/open-osd.git - - b. Build the library module like this: - [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd - - This will build both the open-osd initiator as well as the exofs kernel - module. Use whatever parameters you compiled your Kernel with and - $(KER_DIR) above pointing to the Kernel you compile against. See the file - open-osd/top-level-Makefile for an example. - -2. Get the OSD initiator and target set up properly, and login to the target. - See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd - for example script that does all these steps. - -3. Insmod the exofs.ko module: - [exofs]$ insmod exofs.ko - -4. Make sure the directory where you want to mount exists. If not, create it. - (For example, mkdir /mnt/exofs) - -5. At first run you will need to invoke the mkfs.exofs application - - As an example, this will create the file system on: - /dev/osd0 partition ID 65536 - - mkfs.exofs --pid=65536 --format /dev/osd0 - - The --format is optional. If not specified, no OSD_FORMAT will be - performed and a clean file system will be created in the specified pid, - in the available space of the target. (Use --format=size_in_meg to limit - the total LUN space available) - - If pid already exists, it will be deleted and a new one will be created in - its place. Be careful. - - An exofs lives inside a single OSD partition. You can create multiple exofs - filesystems on the same device using multiple pids. - - (run mkfs.exofs without any parameters for usage help message) - -6. Mount the file system. - - For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs: - - mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/ - -7. For reference (See do-exofs example script): - do-exofs start - an example of how to perform the above steps. - do-exofs stop - an example of how to unmount the file system. - do-exofs format - an example of how to format and mkfs a new exofs. - -8. Extra compilation flags (uncomment in fs/exofs/Kbuild): - CONFIG_EXOFS_DEBUG - for debug messages and extra checks. - -=============================================================================== -exofs mount options -=============================================================================== -Similar to any mount command: - mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory - -Where: - -t exofs: specifies the exofs file system - - /dev/osdX: X is a decimal number. /dev/osdX was created after a successful - login into an OSD target. - - mount_exofs_directory: The directory to mount the file system on - - exofs specific options: Options are separated by commas (,) - pid=<integer> - The partition number to mount/create as - container of the filesystem. - This option is mandatory. integer can be - Hex by pre-pending an 0x to the number. - osdname=<id> - Mount by a device's osdname. - osdname is usually a 36 character uuid of the - form "d2683732-c906-4ee1-9dbd-c10c27bb40df". - It is one of the device's uuid specified in the - mkfs.exofs format command. - If this option is specified then the /dev/osdX - above can be empty and is ignored. - to=<integer> - Timeout in ticks for a single command. - default is (60 * HZ) [for debugging only] - -=============================================================================== -DESIGN -=============================================================================== - -* The file system control block (AKA on-disk superblock) resides in an object - with a special ID (defined in common.h). - Information included in the file system control block is used to fill the - in-memory superblock structure at mount time. This object is created before - the file system is used by mkexofs.c. It contains information such as: - - The file system's magic number - - The next inode number to be allocated - -* Each file resides in its own object and contains the data (and it will be - possible to extend the file over multiple objects, though this has not been - implemented yet). - -* A directory is treated as a file, and essentially contains a list of <file - name, inode #> pairs for files that are found in that directory. The object - IDs correspond to the files' inode numbers and will be allocated according to - a bitmap (stored in a separate object). Now they are allocated using a - counter. - -* Each file's control block (AKA on-disk inode) is stored in its object's - attributes. This applies to both regular files and other types (directories, - device files, symlinks, etc.). - -* Credentials are generated per object (inode and superblock) when they are - created in memory (read from disk or created). The credential works for all - operations and is used as long as the object remains in memory. - -* Async OSD operations are used whenever possible, but the target may execute - them out of order. The operations that concern us are create, delete, - readpage, writepage, update_inode, and truncate. The following pairs of - operations should execute in the order written, and we need to prevent them - from executing in reverse order: - - The following are handled with the OBJ_CREATED and OBJ_2BCREATED - flags. OBJ_CREATED is set when we know the object exists on the OSD - - in create's callback function, and when we successfully do a - read_inode. - OBJ_2BCREATED is set in the beginning of the create function, so we - know that we should wait. - - create/delete: delete should wait until the object is created - on the OSD. - - create/readpage: readpage should be able to return a page - full of zeroes in this case. If there was a write already - en-route (i.e. create, writepage, readpage) then the page - would be locked, and so it would really be the same as - create/writepage. - - create/writepage: if writepage is called for a sync write, it - should wait until the object is created on the OSD. - Otherwise, it should just return. - - create/truncate: truncate should wait until the object is - created on the OSD. - - create/update_inode: update_inode should wait until the - object is created on the OSD. - - Handled by VFS locks: - - readpage/delete: shouldn't happen because of page lock. - - writepage/delete: shouldn't happen because of page lock. - - readpage/writepage: shouldn't happen because of page lock. - -=============================================================================== -LICENSE/COPYRIGHT -=============================================================================== -The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel -version 2.6.10). All files include the original copyrights, and the license -is GPL version 2 (only version 2, as is true for the Linux kernel). The -Linux kernel can be downloaded from www.kernel.org. diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index 3a7b60521b94..08c23b60e016 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -343,9 +343,9 @@ FS_IOC_SET_ENCRYPTION_POLICY can fail with the following errors: - ``ENOTEMPTY``: the file is unencrypted and is a nonempty directory - ``ENOTTY``: this type of filesystem does not implement encryption - ``EOPNOTSUPP``: the kernel was not configured with encryption - support for this filesystem, or the filesystem superblock has not + support for filesystems, or the filesystem superblock has not had encryption enabled on it. (For example, to use encryption on an - ext4 filesystem, CONFIG_EXT4_ENCRYPTION must be enabled in the + ext4 filesystem, CONFIG_FS_ENCRYPTION must be enabled in the kernel config, and the superblock must have had the "encrypt" feature flag enabled using ``tune2fs -O encrypt`` or ``mkfs.ext4 -O encrypt``.) @@ -451,10 +451,18 @@ astute users may notice some differences in behavior: - Unencrypted files, or files encrypted with a different encryption policy (i.e. different key, modes, or flags), cannot be renamed or linked into an encrypted directory; see `Encryption policy - enforcement`_. Attempts to do so will fail with EPERM. However, + enforcement`_. Attempts to do so will fail with EXDEV. However, encrypted files can be renamed within an encrypted directory, or into an unencrypted directory. + Note: "moving" an unencrypted file into an encrypted directory, e.g. + with the `mv` program, is implemented in userspace by a copy + followed by a delete. Be aware that the original unencrypted data + may remain recoverable from free space on the disk; prefer to keep + all files encrypted from the very beginning. The `shred` program + may be used to overwrite the source files but isn't guaranteed to be + effective on all filesystems and storage devices. + - Direct I/O is not supported on encrypted files. Attempts to use direct I/O on such files will fall back to buffered I/O. @@ -541,7 +549,7 @@ not be encrypted. Except for those special files, it is forbidden to have unencrypted files, or files encrypted with a different encryption policy, in an encrypted directory tree. Attempts to link or rename such a file into -an encrypted directory will fail with EPERM. This is also enforced +an encrypted directory will fail with EXDEV. This is also enforced during ->lookup() to provide limited protection against offline attacks that try to disable or downgrade encryption in known locations where applications may later write sensitive data. It is recommended diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 605befab300b..1131c34d77f6 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -1,382 +1,43 @@ -===================== -Linux Filesystems API -===================== +=============================== +Filesystems in the Linux kernel +=============================== -The Linux VFS -============= +This under-development manual will, some glorious day, provide +comprehensive information on how the Linux virtual filesystem (VFS) layer +works, along with the filesystems that sit below it. For now, what we have +can be found below. -The Filesystem types --------------------- - -.. kernel-doc:: include/linux/fs.h - :internal: - -The Directory Cache -------------------- - -.. kernel-doc:: fs/dcache.c - :export: - -.. kernel-doc:: include/linux/dcache.h - :internal: - -Inode Handling --------------- - -.. kernel-doc:: fs/inode.c - :export: - -.. kernel-doc:: fs/bad_inode.c - :export: - -Registration and Superblocks ----------------------------- - -.. kernel-doc:: fs/super.c - :export: - -File Locks ----------- - -.. kernel-doc:: fs/locks.c - :export: - -.. kernel-doc:: fs/locks.c - :internal: - -Other Functions ---------------- - -.. kernel-doc:: fs/mpage.c - :export: - -.. kernel-doc:: fs/namei.c - :export: - -.. kernel-doc:: fs/buffer.c - :export: - -.. kernel-doc:: block/bio.c - :export: - -.. kernel-doc:: fs/seq_file.c - :export: - -.. kernel-doc:: fs/filesystems.c - :export: - -.. kernel-doc:: fs/fs-writeback.c - :export: - -.. kernel-doc:: fs/block_dev.c - :export: - -.. kernel-doc:: fs/anon_inodes.c - :export: - -.. kernel-doc:: fs/attr.c - :export: - -.. kernel-doc:: fs/d_path.c - :export: - -.. kernel-doc:: fs/dax.c - :export: - -.. kernel-doc:: fs/direct-io.c - :export: - -.. kernel-doc:: fs/file_table.c - :export: - -.. kernel-doc:: fs/libfs.c - :export: - -.. kernel-doc:: fs/posix_acl.c - :export: - -.. kernel-doc:: fs/stat.c - :export: - -.. kernel-doc:: fs/sync.c - :export: - -.. kernel-doc:: fs/xattr.c - :export: - -The proc filesystem -=================== - -sysctl interface ----------------- - -.. kernel-doc:: kernel/sysctl.c - :export: - -proc filesystem interface -------------------------- - -.. kernel-doc:: fs/proc/base.c - :internal: - -Events based on file descriptors -================================ - -.. kernel-doc:: fs/eventfd.c - :export: - -The Filesystem for Exporting Kernel Objects -=========================================== - -.. kernel-doc:: fs/sysfs/file.c - :export: - -.. kernel-doc:: fs/sysfs/symlink.c - :export: - -The debugfs filesystem +Core VFS documentation ====================== -debugfs interface ------------------ +See these manuals for documentation about the VFS layer itself and how its +algorithms work. -.. kernel-doc:: fs/debugfs/inode.c - :export: +.. toctree:: + :maxdepth: 2 -.. kernel-doc:: fs/debugfs/file.c - :export: + path-lookup.rst + api-summary + splice -The Linux Journalling API +Filesystem support layers ========================= -Overview --------- - -Details -~~~~~~~ - -The journalling layer is easy to use. You need to first of all create a -journal_t data structure. There are two calls to do this dependent on -how you decide to allocate the physical media on which the journal -resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in -filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used -for journal stored on a raw device (in a continuous range of blocks). A -journal_t is a typedef for a struct pointer, so when you are finally -finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up -any used kernel memory. - -Once you have got your journal_t object you need to 'mount' or load the -journal file. The journalling layer expects the space for the journal -was already allocated and initialized properly by the userspace tools. -When loading the journal you must call :c:func:`jbd2_journal_load` to process -journal contents. If the client file system detects the journal contents -does not need to be processed (or even need not have valid contents), it -may call :c:func:`jbd2_journal_wipe` to clear the journal contents before -calling :c:func:`jbd2_journal_load`. - -Note that jbd2_journal_wipe(..,0) calls -:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding -transactions in the journal and similarly :c:func:`jbd2_journal_load` will -call :c:func:`jbd2_journal_recover` if necessary. I would advise reading -:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage. - -Now you can go ahead and start modifying the underlying filesystem. -Almost. - -You still need to actually journal your filesystem changes, this is done -by wrapping them into transactions. Additionally you also need to wrap -the modification of each of the buffers with calls to the journal layer, -so it knows what the modifications you are actually making are. To do -this use :c:func:`jbd2_journal_start` which returns a transaction handle. - -:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`, -which indicates the end of a transaction are nestable calls, so you can -reenter a transaction if necessary, but remember you must call -:c:func:`jbd2_journal_stop` the same number of times as -:c:func:`jbd2_journal_start` before the transaction is completed (or more -accurately leaves the update phase). Ext4/VFS makes use of this feature to -simplify handling of inode dirtying, quota support, etc. - -Inside each transaction you need to wrap the modifications to the -individual buffers (blocks). Before you start to modify a buffer you -need to call :c:func:`jbd2_journal_get_create_access()` / -:c:func:`jbd2_journal_get_write_access()` / -:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the -journalling layer to copy the unmodified -data if it needs to. After all the buffer may be part of a previously -uncommitted transaction. At this point you are at last ready to modify a -buffer, and once you are have done so you need to call -:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a -buffer you now know is now longer required to be pushed back on the -device you can call :c:func:`jbd2_journal_forget` in much the same way as you -might have used :c:func:`bforget` in the past. - -A :c:func:`jbd2_journal_flush` may be called at any time to commit and -checkpoint all your transactions. - -Then at umount time , in your :c:func:`put_super` you can then call -:c:func:`jbd2_journal_destroy` to clean up your in-core journal object. - -Unfortunately there a couple of ways the journal layer can cause a -deadlock. The first thing to note is that each task can only have a -single outstanding transaction at any one time, remember nothing commits -until the outermost :c:func:`jbd2_journal_stop`. This means you must complete -the transaction at the end of each file/inode/address etc. operation you -perform, so that the journalling system isn't re-entered on another -journal. Since transactions can't be nested/batched across differing -journals, and another filesystem other than yours (say ext4) may be -modified in a later syscall. - -The second case to bear in mind is that :c:func:`jbd2_journal_start` can block -if there isn't enough space in the journal for your transaction (based -on the passed nblocks param) - when it blocks it merely(!) needs to wait -for transactions to complete and be committed from other tasks, so -essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid -deadlocks you must treat :c:func:`jbd2_journal_start` / -:c:func:`jbd2_journal_stop` as if they were semaphores and include them in -your semaphore ordering rules to prevent -deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking -behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as -easily as on :c:func:`jbd2_journal_start`. - -Try to reserve the right number of blocks the first time. ;-). This will -be the maximum number of blocks you are going to touch in this -transaction. I advise having a look at at least ext4_jbd.h to see the -basis on which ext4 uses to make these decisions. - -Another wriggle to watch out for is your on-disk block allocation -strategy. Why? Because, if you do a delete, you need to ensure you -haven't reused any of the freed blocks until the transaction freeing -these blocks commits. If you reused these blocks and crash happens, -there is no way to restore the contents of the reallocated blocks at the -end of the last fully committed transaction. One simple way of doing -this is to mark blocks as free in internal in-memory block allocation -structures only after the transaction freeing them commits. Ext4 uses -journal commit callback for this purpose. - -With journal commit callbacks you can ask the journalling layer to call -a callback function when the transaction is finally committed to disk, -so that you can do some of your own management. You ask the journalling -layer for calling the callback by simply setting -``journal->j_commit_callback`` function pointer and that function is -called after each transaction commit. You can also use -``transaction->t_private_list`` for attaching entries to a transaction -that need processing when the transaction commits. - -JBD2 also provides a way to block all transaction updates via -:c:func:`jbd2_journal_lock_updates()` / -:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a -window with a clean and stable fs for a moment. E.g. - -:: - - - jbd2_journal_lock_updates() //stop new stuff happening.. - jbd2_journal_flush() // checkpoint everything. - ..do stuff on stable fs - jbd2_journal_unlock_updates() // carry on with filesystem use. - -The opportunities for abuse and DOS attacks with this should be obvious, -if you allow unprivileged userspace to trigger codepaths containing -these calls. - -Summary -~~~~~~~ - -Using the journal is a matter of wrapping the different context changes, -being each mount, each modification (transaction) and each changed -buffer to tell the journalling layer about them. - -Data Types ----------- - -The journalling layer uses typedefs to 'hide' the concrete definitions -of the structures used. As a client of the JBD2 layer you can just rely -on the using the pointer as a magic cookie of some sort. Obviously the -hiding is not enforced as this is 'C'. - -Structures -~~~~~~~~~~ - -.. kernel-doc:: include/linux/jbd2.h - :internal: - -Functions ---------- - -The functions here are split into two groups those that affect a journal -as a whole, and those which are used to manage transactions - -Journal Level -~~~~~~~~~~~~~ - -.. kernel-doc:: fs/jbd2/journal.c - :export: - -.. kernel-doc:: fs/jbd2/recovery.c - :internal: - -Transasction Level -~~~~~~~~~~~~~~~~~~ - -.. kernel-doc:: fs/jbd2/transaction.c - -See also --------- - -`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen -Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__ - -`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen -Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__ - -splice API -========== - -splice is a method for moving blocks of data around inside the kernel, -without continually transferring them between the kernel and user space. - -.. kernel-doc:: fs/splice.c - -pipes API -========= - -Pipe interfaces are all for in-kernel (builtin image) use. They are not -exported for use by modules. - -.. kernel-doc:: include/linux/pipe_fs_i.h - :internal: - -.. kernel-doc:: fs/pipe.c - -Encryption API -============== - -A library which filesystems can hook into to support transparent -encryption of files and directories. +Documentation for the support code within the filesystem layer for use in +filesystem implementations. .. toctree:: - :maxdepth: 2 - - fscrypt - -Pathname lookup -=============== - - -This write-up is based on three articles published at lwn.net: + :maxdepth: 2 -- <https://lwn.net/Articles/649115/> Pathname lookup in Linux -- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux -- <https://lwn.net/Articles/650786/> A walk among the symlinks + journalling + fscrypt -Written by Neil Brown with help from Al Viro and Jon Corbet. -It has subsequently been updated to reflect changes in the kernel -including: +Filesystem-specific documentation +================================= -- per-directory parallel name lookup. +Documentation for individual filesystem types can be found here. .. toctree:: :maxdepth: 2 - path-lookup.rst + binderfs.rst diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst new file mode 100644 index 000000000000..58ce6b395206 --- /dev/null +++ b/Documentation/filesystems/journalling.rst @@ -0,0 +1,184 @@ +The Linux Journalling API +========================= + +Overview +-------- + +Details +~~~~~~~ + +The journalling layer is easy to use. You need to first of all create a +journal_t data structure. There are two calls to do this dependent on +how you decide to allocate the physical media on which the journal +resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in +filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used +for journal stored on a raw device (in a continuous range of blocks). A +journal_t is a typedef for a struct pointer, so when you are finally +finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up +any used kernel memory. + +Once you have got your journal_t object you need to 'mount' or load the +journal file. The journalling layer expects the space for the journal +was already allocated and initialized properly by the userspace tools. +When loading the journal you must call :c:func:`jbd2_journal_load` to process +journal contents. If the client file system detects the journal contents +does not need to be processed (or even need not have valid contents), it +may call :c:func:`jbd2_journal_wipe` to clear the journal contents before +calling :c:func:`jbd2_journal_load`. + +Note that jbd2_journal_wipe(..,0) calls +:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding +transactions in the journal and similarly :c:func:`jbd2_journal_load` will +call :c:func:`jbd2_journal_recover` if necessary. I would advise reading +:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage. + +Now you can go ahead and start modifying the underlying filesystem. +Almost. + +You still need to actually journal your filesystem changes, this is done +by wrapping them into transactions. Additionally you also need to wrap +the modification of each of the buffers with calls to the journal layer, +so it knows what the modifications you are actually making are. To do +this use :c:func:`jbd2_journal_start` which returns a transaction handle. + +:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`, +which indicates the end of a transaction are nestable calls, so you can +reenter a transaction if necessary, but remember you must call +:c:func:`jbd2_journal_stop` the same number of times as +:c:func:`jbd2_journal_start` before the transaction is completed (or more +accurately leaves the update phase). Ext4/VFS makes use of this feature to +simplify handling of inode dirtying, quota support, etc. + +Inside each transaction you need to wrap the modifications to the +individual buffers (blocks). Before you start to modify a buffer you +need to call :c:func:`jbd2_journal_get_create_access()` / +:c:func:`jbd2_journal_get_write_access()` / +:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the +journalling layer to copy the unmodified +data if it needs to. After all the buffer may be part of a previously +uncommitted transaction. At this point you are at last ready to modify a +buffer, and once you are have done so you need to call +:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a +buffer you now know is now longer required to be pushed back on the +device you can call :c:func:`jbd2_journal_forget` in much the same way as you +might have used :c:func:`bforget` in the past. + +A :c:func:`jbd2_journal_flush` may be called at any time to commit and +checkpoint all your transactions. + +Then at umount time , in your :c:func:`put_super` you can then call +:c:func:`jbd2_journal_destroy` to clean up your in-core journal object. + +Unfortunately there a couple of ways the journal layer can cause a +deadlock. The first thing to note is that each task can only have a +single outstanding transaction at any one time, remember nothing commits +until the outermost :c:func:`jbd2_journal_stop`. This means you must complete +the transaction at the end of each file/inode/address etc. operation you +perform, so that the journalling system isn't re-entered on another +journal. Since transactions can't be nested/batched across differing +journals, and another filesystem other than yours (say ext4) may be +modified in a later syscall. + +The second case to bear in mind is that :c:func:`jbd2_journal_start` can block +if there isn't enough space in the journal for your transaction (based +on the passed nblocks param) - when it blocks it merely(!) needs to wait +for transactions to complete and be committed from other tasks, so +essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid +deadlocks you must treat :c:func:`jbd2_journal_start` / +:c:func:`jbd2_journal_stop` as if they were semaphores and include them in +your semaphore ordering rules to prevent +deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking +behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as +easily as on :c:func:`jbd2_journal_start`. + +Try to reserve the right number of blocks the first time. ;-). This will +be the maximum number of blocks you are going to touch in this +transaction. I advise having a look at at least ext4_jbd.h to see the +basis on which ext4 uses to make these decisions. + +Another wriggle to watch out for is your on-disk block allocation +strategy. Why? Because, if you do a delete, you need to ensure you +haven't reused any of the freed blocks until the transaction freeing +these blocks commits. If you reused these blocks and crash happens, +there is no way to restore the contents of the reallocated blocks at the +end of the last fully committed transaction. One simple way of doing +this is to mark blocks as free in internal in-memory block allocation +structures only after the transaction freeing them commits. Ext4 uses +journal commit callback for this purpose. + +With journal commit callbacks you can ask the journalling layer to call +a callback function when the transaction is finally committed to disk, +so that you can do some of your own management. You ask the journalling +layer for calling the callback by simply setting +``journal->j_commit_callback`` function pointer and that function is +called after each transaction commit. You can also use +``transaction->t_private_list`` for attaching entries to a transaction +that need processing when the transaction commits. + +JBD2 also provides a way to block all transaction updates via +:c:func:`jbd2_journal_lock_updates()` / +:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a +window with a clean and stable fs for a moment. E.g. + +:: + + + jbd2_journal_lock_updates() //stop new stuff happening.. + jbd2_journal_flush() // checkpoint everything. + ..do stuff on stable fs + jbd2_journal_unlock_updates() // carry on with filesystem use. + +The opportunities for abuse and DOS attacks with this should be obvious, +if you allow unprivileged userspace to trigger codepaths containing +these calls. + +Summary +~~~~~~~ + +Using the journal is a matter of wrapping the different context changes, +being each mount, each modification (transaction) and each changed +buffer to tell the journalling layer about them. + +Data Types +---------- + +The journalling layer uses typedefs to 'hide' the concrete definitions +of the structures used. As a client of the JBD2 layer you can just rely +on the using the pointer as a magic cookie of some sort. Obviously the +hiding is not enforced as this is 'C'. + +Structures +~~~~~~~~~~ + +.. kernel-doc:: include/linux/jbd2.h + :internal: + +Functions +--------- + +The functions here are split into two groups those that affect a journal +as a whole, and those which are used to manage transactions + +Journal Level +~~~~~~~~~~~~~ + +.. kernel-doc:: fs/jbd2/journal.c + :export: + +.. kernel-doc:: fs/jbd2/recovery.c + :internal: + +Transasction Level +~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: fs/jbd2/transaction.c + +See also +-------- + +`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen +Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__ + +`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen +Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__ + diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.txt new file mode 100644 index 000000000000..944d1965e917 --- /dev/null +++ b/Documentation/filesystems/mount_api.txt @@ -0,0 +1,709 @@ + ==================== + FILESYSTEM MOUNT API + ==================== + +CONTENTS + + (1) Overview. + + (2) The filesystem context. + + (3) The filesystem context operations. + + (4) Filesystem context security. + + (5) VFS filesystem context operations. + + (6) Parameter description. + + (7) Parameter helper functions. + + +======== +OVERVIEW +======== + +The creation of new mounts is now to be done in a multistep process: + + (1) Create a filesystem context. + + (2) Parse the parameters and attach them to the context. Parameters are + expected to be passed individually from userspace, though legacy binary + parameters can also be handled. + + (3) Validate and pre-process the context. + + (4) Get or create a superblock and mountable root. + + (5) Perform the mount. + + (6) Return an error message attached to the context. + + (7) Destroy the context. + +To support this, the file_system_type struct gains a new field: + + int (*init_fs_context)(struct fs_context *fc); + +which is invoked to set up the filesystem-specific parts of a filesystem +context, including the additional space. + +Note that security initialisation is done *after* the filesystem is called so +that the namespaces may be adjusted first. + + +====================== +THE FILESYSTEM CONTEXT +====================== + +The creation and reconfiguration of a superblock is governed by a filesystem +context. This is represented by the fs_context structure: + + struct fs_context { + const struct fs_context_operations *ops; + struct file_system_type *fs_type; + void *fs_private; + struct dentry *root; + struct user_namespace *user_ns; + struct net *net_ns; + const struct cred *cred; + char *source; + char *subtype; + void *security; + void *s_fs_info; + unsigned int sb_flags; + unsigned int sb_flags_mask; + enum fs_context_purpose purpose:8; + bool sloppy:1; + bool silent:1; + ... + }; + +The fs_context fields are as follows: + + (*) const struct fs_context_operations *ops + + These are operations that can be done on a filesystem context (see + below). This must be set by the ->init_fs_context() file_system_type + operation. + + (*) struct file_system_type *fs_type + + A pointer to the file_system_type of the filesystem that is being + constructed or reconfigured. This retains a reference on the type owner. + + (*) void *fs_private + + A pointer to the file system's private data. This is where the filesystem + will need to store any options it parses. + + (*) struct dentry *root + + A pointer to the root of the mountable tree (and indirectly, the + superblock thereof). This is filled in by the ->get_tree() op. If this + is set, an active reference on root->d_sb must also be held. + + (*) struct user_namespace *user_ns + (*) struct net *net_ns + + There are a subset of the namespaces in use by the invoking process. They + retain references on each namespace. The subscribed namespaces may be + replaced by the filesystem to reflect other sources, such as the parent + mount superblock on an automount. + + (*) const struct cred *cred + + The mounter's credentials. This retains a reference on the credentials. + + (*) char *source + + This specifies the source. It may be a block device (e.g. /dev/sda1) or + something more exotic, such as the "host:/path" that NFS desires. + + (*) char *subtype + + This is a string to be added to the type displayed in /proc/mounts to + qualify it (used by FUSE). This is available for the filesystem to set if + desired. + + (*) void *security + + A place for the LSMs to hang their security data for the superblock. The + relevant security operations are described below. + + (*) void *s_fs_info + + The proposed s_fs_info for a new superblock, set in the superblock by + sget_fc(). This can be used to distinguish superblocks. + + (*) unsigned int sb_flags + (*) unsigned int sb_flags_mask + + Which bits SB_* flags are to be set/cleared in super_block::s_flags. + + (*) enum fs_context_purpose + + This indicates the purpose for which the context is intended. The + available values are: + + FS_CONTEXT_FOR_MOUNT, -- New superblock for explicit mount + FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount + FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount + + (*) bool sloppy + (*) bool silent + + These are set if the sloppy or silent mount options are given. + + [NOTE] sloppy is probably unnecessary when userspace passes over one + option at a time since the error can just be ignored if userspace deems it + to be unimportant. + + [NOTE] silent is probably redundant with sb_flags & SB_SILENT. + +The mount context is created by calling vfs_new_fs_context() or +vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the +structure is not refcounted. + +VFS, security and filesystem mount options are set individually with +vfs_parse_mount_option(). Options provided by the old mount(2) system call as +a page of data can be parsed with generic_parse_monolithic(). + +When mounting, the filesystem is allowed to take data from any of the pointers +and attach it to the superblock (or whatever), provided it clears the pointer +in the mount context. + +The filesystem is also allowed to allocate resources and pin them with the +mount context. For instance, NFS might pin the appropriate protocol version +module. + + +================================= +THE FILESYSTEM CONTEXT OPERATIONS +================================= + +The filesystem context points to a table of operations: + + struct fs_context_operations { + void (*free)(struct fs_context *fc); + int (*dup)(struct fs_context *fc, struct fs_context *src_fc); + int (*parse_param)(struct fs_context *fc, + struct struct fs_parameter *param); + int (*parse_monolithic)(struct fs_context *fc, void *data); + int (*get_tree)(struct fs_context *fc); + int (*reconfigure)(struct fs_context *fc); + }; + +These operations are invoked by the various stages of the mount procedure to +manage the filesystem context. They are as follows: + + (*) void (*free)(struct fs_context *fc); + + Called to clean up the filesystem-specific part of the filesystem context + when the context is destroyed. It should be aware that parts of the + context may have been removed and NULL'd out by ->get_tree(). + + (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc); + + Called when a filesystem context has been duplicated to duplicate the + filesystem-private data. An error may be returned to indicate failure to + do this. + + [!] Note that even if this fails, put_fs_context() will be called + immediately thereafter, so ->dup() *must* make the + filesystem-private data safe for ->free(). + + (*) int (*parse_param)(struct fs_context *fc, + struct struct fs_parameter *param); + + Called when a parameter is being added to the filesystem context. param + points to the key name and maybe a value object. VFS-specific options + will have been weeded out and fc->sb_flags updated in the context. + Security options will also have been weeded out and fc->security updated. + + The parameter can be parsed with fs_parse() and fs_lookup_param(). Note + that the source(s) are presented as parameters named "source". + + If successful, 0 should be returned or a negative error code otherwise. + + (*) int (*parse_monolithic)(struct fs_context *fc, void *data); + + Called when the mount(2) system call is invoked to pass the entire data + page in one go. If this is expected to be just a list of "key[=val]" + items separated by commas, then this may be set to NULL. + + The return value is as for ->parse_param(). + + If the filesystem (e.g. NFS) needs to examine the data first and then + finds it's the standard key-val list then it may pass it off to + generic_parse_monolithic(). + + (*) int (*get_tree)(struct fs_context *fc); + + Called to get or create the mountable root and superblock, using the + information stored in the filesystem context (reconfiguration goes via a + different vector). It may detach any resources it desires from the + filesystem context and transfer them to the superblock it creates. + + On success it should set fc->root to the mountable root and return 0. In + the case of an error, it should return a negative error code. + + The phase on a userspace-driven context will be set to only allow this to + be called once on any particular context. + + (*) int (*reconfigure)(struct fs_context *fc); + + Called to effect reconfiguration of a superblock using information stored + in the filesystem context. It may detach any resources it desires from + the filesystem context and transfer them to the superblock. The + superblock can be found from fc->root->d_sb. + + On success it should return 0. In the case of an error, it should return + a negative error code. + + [NOTE] reconfigure is intended as a replacement for remount_fs. + + +=========================== +FILESYSTEM CONTEXT SECURITY +=========================== + +The filesystem context contains a security pointer that the LSMs can use for +building up a security context for the superblock to be mounted. There are a +number of operations used by the new mount code for this purpose: + + (*) int security_fs_context_alloc(struct fs_context *fc, + struct dentry *reference); + + Called to initialise fc->security (which is preset to NULL) and allocate + any resources needed. It should return 0 on success or a negative error + code on failure. + + reference will be non-NULL if the context is being created for superblock + reconfiguration (FS_CONTEXT_FOR_RECONFIGURE) in which case it indicates + the root dentry of the superblock to be reconfigured. It will also be + non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case + it indicates the automount point. + + (*) int security_fs_context_dup(struct fs_context *fc, + struct fs_context *src_fc); + + Called to initialise fc->security (which is preset to NULL) and allocate + any resources needed. The original filesystem context is pointed to by + src_fc and may be used for reference. It should return 0 on success or a + negative error code on failure. + + (*) void security_fs_context_free(struct fs_context *fc); + + Called to clean up anything attached to fc->security. Note that the + contents may have been transferred to a superblock and the pointer cleared + during get_tree. + + (*) int security_fs_context_parse_param(struct fs_context *fc, + struct fs_parameter *param); + + Called for each mount parameter, including the source. The arguments are + as for the ->parse_param() method. It should return 0 to indicate that + the parameter should be passed on to the filesystem, 1 to indicate that + the parameter should be discarded or an error to indicate that the + parameter should be rejected. + + The value pointed to by param may be modified (if a string) or stolen + (provided the value pointer is NULL'd out). If it is stolen, 1 must be + returned to prevent it being passed to the filesystem. + + (*) int security_fs_context_validate(struct fs_context *fc); + + Called after all the options have been parsed to validate the collection + as a whole and to do any necessary allocation so that + security_sb_get_tree() and security_sb_reconfigure() are less likely to + fail. It should return 0 or a negative error code. + + In the case of reconfiguration, the target superblock will be accessible + via fc->root. + + (*) int security_sb_get_tree(struct fs_context *fc); + + Called during the mount procedure to verify that the specified superblock + is allowed to be mounted and to transfer the security data there. It + should return 0 or a negative error code. + + (*) void security_sb_reconfigure(struct fs_context *fc); + + Called to apply any reconfiguration to an LSM's context. It must not + fail. Error checking and resource allocation must be done in advance by + the parameter parsing and validation hooks. + + (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint, + unsigned int mnt_flags); + + Called during the mount procedure to verify that the root dentry attached + to the context is permitted to be attached to the specified mountpoint. + It should return 0 on success or a negative error code on failure. + + +================================= +VFS FILESYSTEM CONTEXT OPERATIONS +================================= + +There are four operations for creating a filesystem context and +one for destroying a context: + + (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type, + struct dentry *reference, + unsigned int sb_flags, + unsigned int sb_flags_mask, + enum fs_context_purpose purpose); + + Create a filesystem context for a given filesystem type and purpose. This + allocates the filesystem context, sets the superblock flags, initialises + the security and calls fs_type->init_fs_context() to initialise the + filesystem private data. + + reference can be NULL or it may indicate the root dentry of a superblock + that is going to be reconfigured (FS_CONTEXT_FOR_RECONFIGURE) or + the automount point that triggered a submount (FS_CONTEXT_FOR_SUBMOUNT). + This is provided as a source of namespace information. + + (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc); + + Duplicate a filesystem context, copying any options noted and duplicating + or additionally referencing any resources held therein. This is available + for use where a filesystem has to get a mount within a mount, such as NFS4 + does by internally mounting the root of the target server and then doing a + private pathwalk to the target directory. + + The purpose in the new context is inherited from the old one. + + (*) void put_fs_context(struct fs_context *fc); + + Destroy a filesystem context, releasing any resources it holds. This + calls the ->free() operation. This is intended to be called by anyone who + created a filesystem context. + + [!] filesystem contexts are not refcounted, so this causes unconditional + destruction. + +In all the above operations, apart from the put op, the return is a mount +context pointer or a negative error code. + +For the remaining operations, if an error occurs, a negative error code will be +returned. + + (*) int vfs_get_tree(struct fs_context *fc); + + Get or create the mountable root and superblock, using the parameters in + the filesystem context to select/configure the superblock. This invokes + the ->validate() op and then the ->get_tree() op. + + [NOTE] ->validate() could perhaps be rolled into ->get_tree() and + ->reconfigure(). + + (*) struct vfsmount *vfs_create_mount(struct fs_context *fc); + + Create a mount given the parameters in the specified filesystem context. + Note that this does not attach the mount to anything. + + (*) int vfs_parse_fs_param(struct fs_context *fc, + struct fs_parameter *param); + + Supply a single mount parameter to the filesystem context. This include + the specification of the source/device which is specified as the "source" + parameter (which may be specified multiple times if the filesystem + supports that). + + param specifies the parameter key name and the value. The parameter is + first checked to see if it corresponds to a standard mount flag (in which + case it is used to set an SB_xxx flag and consumed) or a security option + (in which case the LSM consumes it) before it is passed on to the + filesystem. + + The parameter value is typed and can be one of: + + fs_value_is_flag, Parameter not given a value. + fs_value_is_string, Value is a string + fs_value_is_blob, Value is a binary blob + fs_value_is_filename, Value is a filename* + dirfd + fs_value_is_filename_empty, Value is a filename* + dirfd + AT_EMPTY_PATH + fs_value_is_file, Value is an open file (file*) + + If there is a value, that value is stored in a union in the struct in one + of param->{string,blob,name,file}. Note that the function may steal and + clear the pointer, but then becomes responsible for disposing of the + object. + + (*) int vfs_parse_fs_string(struct fs_context *fc, char *key, + const char *value, size_t v_size); + + A wrapper around vfs_parse_fs_param() that just passes a constant string. + + (*) int generic_parse_monolithic(struct fs_context *fc, void *data); + + Parse a sys_mount() data page, assuming the form to be a text list + consisting of key[=val] options separated by commas. Each item in the + list is passed to vfs_mount_option(). This is the default when the + ->parse_monolithic() operation is NULL. + + +===================== +PARAMETER DESCRIPTION +===================== + +Parameters are described using structures defined in linux/fs_parser.h. +There's a core description struct that links everything together: + + struct fs_parameter_description { + const char name[16]; + u8 nr_params; + u8 nr_alt_keys; + u8 nr_enums; + bool ignore_unknown; + bool no_source; + const char *const *keys; + const struct constant_table *alt_keys; + const struct fs_parameter_spec *specs; + const struct fs_parameter_enum *enums; + }; + +For example: + + enum afs_param { + Opt_autocell, + Opt_bar, + Opt_dyn, + Opt_foo, + Opt_source, + nr__afs_params + }; + + static const struct fs_parameter_description afs_fs_parameters = { + .name = "kAFS", + .nr_params = nr__afs_params, + .nr_alt_keys = ARRAY_SIZE(afs_param_alt_keys), + .nr_enums = ARRAY_SIZE(afs_param_enums), + .keys = afs_param_keys, + .alt_keys = afs_param_alt_keys, + .specs = afs_param_specs, + .enums = afs_param_enums, + }; + +The members are as follows: + + (1) const char name[16]; + + The name to be used in error messages generated by the parse helper + functions. + + (2) u8 nr_params; + + The number of discrete parameter identifiers. This indicates the number + of elements in the ->types[] array and also limits the values that may be + used in the values that the ->keys[] array maps to. + + It is expected that, for example, two parameters that are related, say + "acl" and "noacl" with have the same ID, but will be flagged to indicate + that one is the inverse of the other. The value can then be picked out + from the parse result. + + (3) const struct fs_parameter_specification *specs; + + Table of parameter specifications, where the entries are of type: + + struct fs_parameter_type { + enum fs_parameter_spec type:8; + u8 flags; + }; + + and the parameter identifier is the index to the array. 'type' indicates + the desired value type and must be one of: + + TYPE NAME EXPECTED VALUE RESULT IN + ======================= ======================= ===================== + fs_param_is_flag No value n/a + fs_param_is_bool Boolean value result->boolean + fs_param_is_u32 32-bit unsigned int result->uint_32 + fs_param_is_u32_octal 32-bit octal int result->uint_32 + fs_param_is_u32_hex 32-bit hex int result->uint_32 + fs_param_is_s32 32-bit signed int result->int_32 + fs_param_is_enum Enum value name result->uint_32 + fs_param_is_string Arbitrary string param->string + fs_param_is_blob Binary blob param->blob + fs_param_is_blockdev Blockdev path * Needs lookup + fs_param_is_path Path * Needs lookup + fs_param_is_fd File descriptor param->file + + And each parameter can be qualified with 'flags': + + fs_param_v_optional The value is optional + fs_param_neg_with_no If key name is prefixed with "no", it is false + fs_param_neg_with_empty If value is "", it is false + fs_param_deprecated The parameter is deprecated. + + For example: + + static const struct fs_parameter_spec afs_param_specs[nr__afs_params] = { + [Opt_autocell] = { fs_param_is flag }, + [Opt_bar] = { fs_param_is_enum }, + [Opt_dyn] = { fs_param_is flag }, + [Opt_foo] = { fs_param_is_bool, fs_param_neg_with_no }, + [Opt_source] = { fs_param_is_string }, + }; + + Note that if the value is of fs_param_is_bool type, fs_parse() will try + to match any string value against "0", "1", "no", "yes", "false", "true". + + [!] NOTE that the table must be sorted according to primary key name so + that ->keys[] is also sorted. + + (4) const char *const *keys; + + Table of primary key names for the parameters. There must be one entry + per defined parameter. The table is optional if ->nr_params is 0. The + table is just an array of names e.g.: + + static const char *const afs_param_keys[nr__afs_params] = { + [Opt_autocell] = "autocell", + [Opt_bar] = "bar", + [Opt_dyn] = "dyn", + [Opt_foo] = "foo", + [Opt_source] = "source", + }; + + [!] NOTE that the table must be sorted such that the table can be searched + with bsearch() using strcmp(). This means that the Opt_* values must + correspond to the entries in this table. + + (5) const struct constant_table *alt_keys; + u8 nr_alt_keys; + + Table of additional key names and their mappings to parameter ID plus the + number of elements in the table. This is optional. The table is just an + array of { name, integer } pairs, e.g.: + + static const struct constant_table afs_param_keys[] = { + { "baz", Opt_bar }, + { "dynamic", Opt_dyn }, + }; + + [!] NOTE that the table must be sorted such that strcmp() can be used with + bsearch() to search the entries. + + The parameter ID can also be fs_param_key_removed to indicate that a + deprecated parameter has been removed and that an error will be given. + This differs from fs_param_deprecated where the parameter may still have + an effect. + + Further, the behaviour of the parameter may differ when an alternate name + is used (for instance with NFS, "v3", "v4.2", etc. are alternate names). + + (6) const struct fs_parameter_enum *enums; + u8 nr_enums; + + Table of enum value names to integer mappings and the number of elements + stored therein. This is of type: + + struct fs_parameter_enum { + u8 param_id; + char name[14]; + u8 value; + }; + + Where the array is an unsorted list of { parameter ID, name }-keyed + elements that indicate the value to map to, e.g.: + + static const struct fs_parameter_enum afs_param_enums[] = { + { Opt_bar, "x", 1}, + { Opt_bar, "y", 23}, + { Opt_bar, "z", 42}, + }; + + If a parameter of type fs_param_is_enum is encountered, fs_parse() will + try to look the value up in the enum table and the result will be stored + in the parse result. + + (7) bool no_source; + + If this is set, fs_parse() will ignore any "source" parameter and not + pass it to the filesystem. + +The parser should be pointed to by the parser pointer in the file_system_type +struct as this will provide validation on registration (if +CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from +userspace using the fsinfo() syscall. + + +========================== +PARAMETER HELPER FUNCTIONS +========================== + +A number of helper functions are provided to help a filesystem or an LSM +process the parameters it is given. + + (*) int lookup_constant(const struct constant_table tbl[], + const char *name, int not_found); + + Look up a constant by name in a table of name -> integer mappings. The + table is an array of elements of the following type: + + struct constant_table { + const char *name; + int value; + }; + + and it must be sorted such that it can be searched using bsearch() using + strcmp(). If a match is found, the corresponding value is returned. If a + match isn't found, the not_found value is returned instead. + + (*) bool validate_constant_table(const struct constant_table *tbl, + size_t tbl_size, + int low, int high, int special); + + Validate a constant table. Checks that all the elements are appropriately + ordered, that there are no duplicates and that the values are between low + and high inclusive, though provision is made for one allowable special + value outside of that range. If no special value is required, special + should just be set to lie inside the low-to-high range. + + If all is good, true is returned. If the table is invalid, errors are + logged to dmesg, the stack is dumped and false is returned. + + (*) int fs_parse(struct fs_context *fc, + const struct fs_param_parser *parser, + struct fs_parameter *param, + struct fs_param_parse_result *result); + + This is the main interpreter of parameters. It uses the parameter + description (parser) to look up the name of the parameter to use and to + convert that to a parameter ID (stored in result->key). + + If successful, and if the parameter type indicates the result is a + boolean, integer or enum type, the value is converted by this function and + the result stored in result->{boolean,int_32,uint_32}. + + If a match isn't initially made, the key is prefixed with "no" and no + value is present then an attempt will be made to look up the key with the + prefix removed. If this matches a parameter for which the type has flag + fs_param_neg_with_no set, then a match will be made and the value will be + set to false/0/NULL. + + If the parameter is successfully matched and, optionally, parsed + correctly, 1 is returned. If the parameter isn't matched and + parser->ignore_unknown is set, then 0 is returned. Otherwise -EINVAL is + returned. + + (*) bool fs_validate_description(const struct fs_parameter_description *desc); + + This is validates the parameter description. It returns true if the + description is good and false if it is not. + + (*) int fs_lookup_param(struct fs_context *fc, + struct fs_parameter *value, + bool want_bdev, + struct path *_path); + + This takes a parameter that carries a string or filename type and attempts + to do a path lookup on it. If the parameter expects a blockdev, a check + is made that the inode actually represents one. + + Returns 0 if successful and *_path will be set; returns a negative error + code if not. diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 9d6b68853f5b..434a07b0002b 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1,3 +1,18 @@ +=============== +Pathname lookup +=============== + +This write-up is based on three articles published at lwn.net: + +- <https://lwn.net/Articles/649115/> Pathname lookup in Linux +- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux +- <https://lwn.net/Articles/650786/> A walk among the symlinks + +Written by Neil Brown with help from Al Viro and Jon Corbet. +It has subsequently been updated to reflect changes in the kernel +including: + +- per-directory parallel name lookup. Introduction to pathname lookup =============================== @@ -344,7 +359,7 @@ In particular it is held while scanning chains in the dcache hash table, and the mount point hash table. Bringing it together with ``struct nameidata`` --------------------------------------------- +---------------------------------------------- .. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s @@ -355,7 +370,7 @@ converts a "name" to an "inode". ``struct nameidata`` contains (among other fields): ``struct path path`` -~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~ A ``path`` contains a ``struct vfsmount`` (which is embedded in a ``struct mount``) and a ``struct dentry``. Together these @@ -366,13 +381,13 @@ step. A reference through ``d_lockref`` and ``mnt_count`` is always held. ``struct qstr last`` -~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~ This is a string together with a length (i.e. _not_ ``nul`` terminated) that is the "next" component in the pathname. ``int last_type`` -~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~ This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT``, ``LAST_DOTDOT``, or ``LAST_BIND``. The ``last`` field is only valid if the type is @@ -381,7 +396,7 @@ components of the symlink have been processed yet. Others should be fairly self-explanatory. ``struct path root`` -~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~ This is used to hold a reference to the effective root of the filesystem. Often that reference won't be needed, so this field is @@ -510,7 +525,7 @@ potentially interesting things about these dentries corresponding to three different flags that might be set in ``dentry->d_flags``: ``DCACHE_MANAGE_TRANSIT`` -~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~ If this flag has been set, then the filesystem has requested that the ``d_manage()`` dentry operation be called before handling any possible @@ -529,7 +544,7 @@ filesystem, which will then give it a special pass through ``d_manage()`` by returning ``-EISDIR``. ``DCACHE_MOUNTED`` -~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~ This flag is set on every dentry that is mounted on. As Linux supports multiple filesystem namespaces, it is possible that the @@ -542,7 +557,7 @@ If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``, and a new ``dentry`` (both with counted references). ``DCACHE_NEED_AUTOMOUNT`` -~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~ If ``d_manage()`` allowed us to get this far, and ``lookup_mnt()`` didn't find a mount point, then this flag causes the ``d_automount()`` dentry @@ -698,7 +713,7 @@ With that little refresher on seqlocks out of the way we can look at the bigger picture of how RCU-walk uses seqlocks. ``mount_lock`` and ``nd->m_seq`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We already met the ``mount_lock`` seqlock when REF-walk used it to ensure that crossing a mount point is performed safely. RCU-walk uses @@ -727,7 +742,7 @@ results would have been the same. This ensures the invariant holds, at least for vfsmount structures. ``dentry->d_seq`` and ``nd->seq`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In place of taking a count or lock on ``d_reflock``, RCU-walk samples the per-dentry ``d_seq`` seqlock, and stores the sequence number in the @@ -774,7 +789,7 @@ getting a counted reference to the new dentry before dropping that for the old dentry which we saw in REF-walk. No ``inode->i_rwsem`` or even ``rename_lock`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A semaphore is a fairly heavyweight lock that can only be taken when it is permissible to sleep. As ``rcu_read_lock()`` forbids sleeping, @@ -796,7 +811,7 @@ locking. This neatly handles all cases, so adding extra checks on rename_lock would bring no significant value. ``unlazy walk()`` and ``complete_walk()`` -------------------------------------- +----------------------------------------- That "dropping down to REF-walk" typically involves a call to ``unlazy_walk()``, so named because "RCU-walk" is also sometimes diff --git a/Documentation/filesystems/splice.rst b/Documentation/filesystems/splice.rst new file mode 100644 index 000000000000..edd874808472 --- /dev/null +++ b/Documentation/filesystems/splice.rst @@ -0,0 +1,22 @@ +================ +splice and pipes +================ + +splice API +========== + +splice is a method for moving blocks of data around inside the kernel, +without continually transferring them between the kernel and user space. + +.. kernel-doc:: fs/splice.c + +pipes API +========= + +Pipe interfaces are all for in-kernel (builtin image) use. They are not +exported for use by modules. + +.. kernel-doc:: include/linux/pipe_fs_i.h + :internal: + +.. kernel-doc:: fs/pipe.c diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index 41411b0c60a3..5b5311f9358d 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -116,6 +116,27 @@ static struct device_attribute dev_attr_foo = { .store = store_foo, }; +Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally +considered a bad idea." so trying to set a sysfs file writable for +everyone will fail reverting to RO mode for "Others". + +For the common cases sysfs.h provides convenience macros to make +defining attributes easier as well as making code more concise and +readable. The above case could be shortened to: + +static struct device_attribute dev_attr_foo = __ATTR_RW(foo); + +the list of helpers available to define your wrapper function is: +__ATTR_RO(name): assumes default name_show and mode 0444 +__ATTR_WO(name): assumes a name_store only and is restricted to mode + 0200 that is root write access only. +__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently + only use case is the EFI System Resource Table + (see drivers/firmware/efi/esrt.c) +__ATTR_RW(name): assumes default name_show, name_store and setting + mode to 0644. +__ATTR_NULL: which sets the name to NULL and is used as end of list + indicator (see: kernel/workqueue.c) Subsystem-Specific Callbacks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 8dc8e9c2913f..761c6fd24a53 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -902,6 +903,8 @@ otherwise noted. write_iter: possibly asynchronous write with iov_iter as source + iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents iterate_shared: called when the VFS needs to read the directory contents diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index 9ccfd1bc6201..a5cbb5e0e3db 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt @@ -272,7 +272,7 @@ The following sysctls are available for the XFS filesystem: XFS_ERRLEVEL_LOW: 1 XFS_ERRLEVEL_HIGH: 5 - fs.xfs.panic_mask (Min: 0 Default: 0 Max: 255) + fs.xfs.panic_mask (Min: 0 Default: 0 Max: 256) Causes certain error conditions to call BUG(). Value is a bitmask; OR together the tags which represent errors which should cause panics: @@ -285,6 +285,7 @@ The following sysctls are available for the XFS filesystem: XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 XFS_PTAG_FSBLOCK_ZERO 0x00000080 + XFS_PTAG_VERIFIER_ERROR 0x00000100 This option is intended for debugging only. |