summaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/api-summary.rst150
-rw-r--r--Documentation/filesystems/binderfs.rst68
-rw-r--r--Documentation/filesystems/exofs.txt185
-rw-r--r--Documentation/filesystems/fscrypt.rst16
-rw-r--r--Documentation/filesystems/index.rst389
-rw-r--r--Documentation/filesystems/journalling.rst184
-rw-r--r--Documentation/filesystems/mount_api.txt709
-rw-r--r--Documentation/filesystems/path-lookup.rst39
-rw-r--r--Documentation/filesystems/splice.rst22
-rw-r--r--Documentation/filesystems/sysfs.txt21
-rw-r--r--Documentation/filesystems/vfs.txt3
-rw-r--r--Documentation/filesystems/xfs.txt3
12 files changed, 1223 insertions, 566 deletions
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst
new file mode 100644
index 000000000000..aa51ffcfa029
--- /dev/null
+++ b/Documentation/filesystems/api-summary.rst
@@ -0,0 +1,150 @@
+=============================
+Linux Filesystems API summary
+=============================
+
+This section contains API-level documentation, mostly taken from the source
+code itself.
+
+The Linux VFS
+=============
+
+The Filesystem types
+--------------------
+
+.. kernel-doc:: include/linux/fs.h
+ :internal:
+
+The Directory Cache
+-------------------
+
+.. kernel-doc:: fs/dcache.c
+ :export:
+
+.. kernel-doc:: include/linux/dcache.h
+ :internal:
+
+Inode Handling
+--------------
+
+.. kernel-doc:: fs/inode.c
+ :export:
+
+.. kernel-doc:: fs/bad_inode.c
+ :export:
+
+Registration and Superblocks
+----------------------------
+
+.. kernel-doc:: fs/super.c
+ :export:
+
+File Locks
+----------
+
+.. kernel-doc:: fs/locks.c
+ :export:
+
+.. kernel-doc:: fs/locks.c
+ :internal:
+
+Other Functions
+---------------
+
+.. kernel-doc:: fs/mpage.c
+ :export:
+
+.. kernel-doc:: fs/namei.c
+ :export:
+
+.. kernel-doc:: fs/buffer.c
+ :export:
+
+.. kernel-doc:: block/bio.c
+ :export:
+
+.. kernel-doc:: fs/seq_file.c
+ :export:
+
+.. kernel-doc:: fs/filesystems.c
+ :export:
+
+.. kernel-doc:: fs/fs-writeback.c
+ :export:
+
+.. kernel-doc:: fs/block_dev.c
+ :export:
+
+.. kernel-doc:: fs/anon_inodes.c
+ :export:
+
+.. kernel-doc:: fs/attr.c
+ :export:
+
+.. kernel-doc:: fs/d_path.c
+ :export:
+
+.. kernel-doc:: fs/dax.c
+ :export:
+
+.. kernel-doc:: fs/direct-io.c
+ :export:
+
+.. kernel-doc:: fs/file_table.c
+ :export:
+
+.. kernel-doc:: fs/libfs.c
+ :export:
+
+.. kernel-doc:: fs/posix_acl.c
+ :export:
+
+.. kernel-doc:: fs/stat.c
+ :export:
+
+.. kernel-doc:: fs/sync.c
+ :export:
+
+.. kernel-doc:: fs/xattr.c
+ :export:
+
+The proc filesystem
+===================
+
+sysctl interface
+----------------
+
+.. kernel-doc:: kernel/sysctl.c
+ :export:
+
+proc filesystem interface
+-------------------------
+
+.. kernel-doc:: fs/proc/base.c
+ :internal:
+
+Events based on file descriptors
+================================
+
+.. kernel-doc:: fs/eventfd.c
+ :export:
+
+The Filesystem for Exporting Kernel Objects
+===========================================
+
+.. kernel-doc:: fs/sysfs/file.c
+ :export:
+
+.. kernel-doc:: fs/sysfs/symlink.c
+ :export:
+
+The debugfs filesystem
+======================
+
+debugfs interface
+-----------------
+
+.. kernel-doc:: fs/debugfs/inode.c
+ :export:
+
+.. kernel-doc:: fs/debugfs/file.c
+ :export:
diff --git a/Documentation/filesystems/binderfs.rst b/Documentation/filesystems/binderfs.rst
new file mode 100644
index 000000000000..c009671f8434
--- /dev/null
+++ b/Documentation/filesystems/binderfs.rst
@@ -0,0 +1,68 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+The Android binderfs Filesystem
+===============================
+
+Android binderfs is a filesystem for the Android binder IPC mechanism. It
+allows to dynamically add and remove binder devices at runtime. Binder devices
+located in a new binderfs instance are independent of binder devices located in
+other binderfs instances. Mounting a new binderfs instance makes it possible
+to get a set of private binder devices.
+
+Mounting binderfs
+-----------------
+
+Android binderfs can be mounted with::
+
+ mkdir /dev/binderfs
+ mount -t binder binder /dev/binderfs
+
+at which point a new instance of binderfs will show up at ``/dev/binderfs``.
+In a fresh instance of binderfs no binder devices will be present. There will
+only be a ``binder-control`` device which serves as the request handler for
+binderfs. Mounting another binderfs instance at a different location will
+create a new and separate instance from all other binderfs mounts. This is
+identical to the behavior of e.g. ``devpts`` and ``tmpfs``. The Android
+binderfs filesystem can be mounted in user namespaces.
+
+Options
+-------
+max
+ binderfs instances can be mounted with a limit on the number of binder
+ devices that can be allocated. The ``max=<count>`` mount option serves as
+ a per-instance limit. If ``max=<count>`` is set then only ``<count>`` number
+ of binder devices can be allocated in this binderfs instance.
+
+Allocating binder Devices
+-------------------------
+
+.. _ioctl: http://man7.org/linux/man-pages/man2/ioctl.2.html
+
+To allocate a new binder device in a binderfs instance a request needs to be
+sent through the ``binder-control`` device node. A request is sent in the form
+of an `ioctl() <ioctl_>`_.
+
+What a program needs to do is to open the ``binder-control`` device node and
+send a ``BINDER_CTL_ADD`` request to the kernel. Users of binderfs need to
+tell the kernel which name the new binder device should get. By default a name
+can only contain up to ``BINDERFS_MAX_NAME`` chars including the terminating
+zero byte.
+
+Once the request is made via an `ioctl() <ioctl_>`_ passing a ``struct
+binder_device`` with the name to the kernel it will allocate a new binder
+device and return the major and minor number of the new device in the struct
+(This is necessary because binderfs allocates a major device number
+dynamically.). After the `ioctl() <ioctl_>`_ returns there will be a new
+binder device located under /dev/binderfs with the chosen name.
+
+Deleting binder Devices
+-----------------------
+
+.. _unlink: http://man7.org/linux/man-pages/man2/unlink.2.html
+.. _rm: http://man7.org/linux/man-pages/man1/rm.1.html
+
+Binderfs binder devices can be deleted via `unlink() <unlink_>`_. This means
+that the `rm() <rm_>`_ tool can be used to delete them. Note that the
+``binder-control`` device cannot be deleted since this would make the binderfs
+instance unuseable. The ``binder-control`` device will be deleted when the
+binderfs instance is unmounted and all references to it have been dropped.
diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt
deleted file mode 100644
index 23583a136975..000000000000
--- a/Documentation/filesystems/exofs.txt
+++ /dev/null
@@ -1,185 +0,0 @@
-===============================================================================
-WHAT IS EXOFS?
-===============================================================================
-
-exofs is a file system that uses an OSD and exports the API of a normal Linux
-file system. Users access exofs like any other local file system, and exofs
-will in turn issue commands to the local OSD initiator.
-
-OSD is a new T10 command set that views storage devices not as a large/flat
-array of sectors but as a container of objects, each having a length, quota,
-time attributes and more. Each object is addressed by a 64bit ID, and is
-contained in a 64bit ID partition. Each object has associated attributes
-attached to it, which are integral part of the object and provide metadata about
-the object. The standard defines some common obligatory attributes, but user
-attributes can be added as needed.
-
-===============================================================================
-ENVIRONMENT
-===============================================================================
-
-To use this file system, you need to have an object store to run it on. You
-may download a target from:
-http://open-osd.org
-
-See Documentation/scsi/osd.txt for how to setup a working osd environment.
-
-===============================================================================
-USAGE
-===============================================================================
-
-1. Download and compile exofs and open-osd initiator:
- You need an external Kernel source tree or kernel headers from your
- distribution. (anything based on 2.6.26 or later).
-
- a. download open-osd including exofs source using:
- [parent-directory]$ git clone git://git.open-osd.org/open-osd.git
-
- b. Build the library module like this:
- [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd
-
- This will build both the open-osd initiator as well as the exofs kernel
- module. Use whatever parameters you compiled your Kernel with and
- $(KER_DIR) above pointing to the Kernel you compile against. See the file
- open-osd/top-level-Makefile for an example.
-
-2. Get the OSD initiator and target set up properly, and login to the target.
- See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd
- for example script that does all these steps.
-
-3. Insmod the exofs.ko module:
- [exofs]$ insmod exofs.ko
-
-4. Make sure the directory where you want to mount exists. If not, create it.
- (For example, mkdir /mnt/exofs)
-
-5. At first run you will need to invoke the mkfs.exofs application
-
- As an example, this will create the file system on:
- /dev/osd0 partition ID 65536
-
- mkfs.exofs --pid=65536 --format /dev/osd0
-
- The --format is optional. If not specified, no OSD_FORMAT will be
- performed and a clean file system will be created in the specified pid,
- in the available space of the target. (Use --format=size_in_meg to limit
- the total LUN space available)
-
- If pid already exists, it will be deleted and a new one will be created in
- its place. Be careful.
-
- An exofs lives inside a single OSD partition. You can create multiple exofs
- filesystems on the same device using multiple pids.
-
- (run mkfs.exofs without any parameters for usage help message)
-
-6. Mount the file system.
-
- For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs:
-
- mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/
-
-7. For reference (See do-exofs example script):
- do-exofs start - an example of how to perform the above steps.
- do-exofs stop - an example of how to unmount the file system.
- do-exofs format - an example of how to format and mkfs a new exofs.
-
-8. Extra compilation flags (uncomment in fs/exofs/Kbuild):
- CONFIG_EXOFS_DEBUG - for debug messages and extra checks.
-
-===============================================================================
-exofs mount options
-===============================================================================
-Similar to any mount command:
- mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory
-
-Where:
- -t exofs: specifies the exofs file system
-
- /dev/osdX: X is a decimal number. /dev/osdX was created after a successful
- login into an OSD target.
-
- mount_exofs_directory: The directory to mount the file system on
-
- exofs specific options: Options are separated by commas (,)
- pid=<integer> - The partition number to mount/create as
- container of the filesystem.
- This option is mandatory. integer can be
- Hex by pre-pending an 0x to the number.
- osdname=<id> - Mount by a device's osdname.
- osdname is usually a 36 character uuid of the
- form "d2683732-c906-4ee1-9dbd-c10c27bb40df".
- It is one of the device's uuid specified in the
- mkfs.exofs format command.
- If this option is specified then the /dev/osdX
- above can be empty and is ignored.
- to=<integer> - Timeout in ticks for a single command.
- default is (60 * HZ) [for debugging only]
-
-===============================================================================
-DESIGN
-===============================================================================
-
-* The file system control block (AKA on-disk superblock) resides in an object
- with a special ID (defined in common.h).
- Information included in the file system control block is used to fill the
- in-memory superblock structure at mount time. This object is created before
- the file system is used by mkexofs.c. It contains information such as:
- - The file system's magic number
- - The next inode number to be allocated
-
-* Each file resides in its own object and contains the data (and it will be
- possible to extend the file over multiple objects, though this has not been
- implemented yet).
-
-* A directory is treated as a file, and essentially contains a list of <file
- name, inode #> pairs for files that are found in that directory. The object
- IDs correspond to the files' inode numbers and will be allocated according to
- a bitmap (stored in a separate object). Now they are allocated using a
- counter.
-
-* Each file's control block (AKA on-disk inode) is stored in its object's
- attributes. This applies to both regular files and other types (directories,
- device files, symlinks, etc.).
-
-* Credentials are generated per object (inode and superblock) when they are
- created in memory (read from disk or created). The credential works for all
- operations and is used as long as the object remains in memory.
-
-* Async OSD operations are used whenever possible, but the target may execute
- them out of order. The operations that concern us are create, delete,
- readpage, writepage, update_inode, and truncate. The following pairs of
- operations should execute in the order written, and we need to prevent them
- from executing in reverse order:
- - The following are handled with the OBJ_CREATED and OBJ_2BCREATED
- flags. OBJ_CREATED is set when we know the object exists on the OSD -
- in create's callback function, and when we successfully do a
- read_inode.
- OBJ_2BCREATED is set in the beginning of the create function, so we
- know that we should wait.
- - create/delete: delete should wait until the object is created
- on the OSD.
- - create/readpage: readpage should be able to return a page
- full of zeroes in this case. If there was a write already
- en-route (i.e. create, writepage, readpage) then the page
- would be locked, and so it would really be the same as
- create/writepage.
- - create/writepage: if writepage is called for a sync write, it
- should wait until the object is created on the OSD.
- Otherwise, it should just return.
- - create/truncate: truncate should wait until the object is
- created on the OSD.
- - create/update_inode: update_inode should wait until the
- object is created on the OSD.
- - Handled by VFS locks:
- - readpage/delete: shouldn't happen because of page lock.
- - writepage/delete: shouldn't happen because of page lock.
- - readpage/writepage: shouldn't happen because of page lock.
-
-===============================================================================
-LICENSE/COPYRIGHT
-===============================================================================
-The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel
-version 2.6.10). All files include the original copyrights, and the license
-is GPL version 2 (only version 2, as is true for the Linux kernel). The
-Linux kernel can be downloaded from www.kernel.org.
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index 3a7b60521b94..08c23b60e016 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -343,9 +343,9 @@ FS_IOC_SET_ENCRYPTION_POLICY can fail with the following errors:
- ``ENOTEMPTY``: the file is unencrypted and is a nonempty directory
- ``ENOTTY``: this type of filesystem does not implement encryption
- ``EOPNOTSUPP``: the kernel was not configured with encryption
- support for this filesystem, or the filesystem superblock has not
+ support for filesystems, or the filesystem superblock has not
had encryption enabled on it. (For example, to use encryption on an
- ext4 filesystem, CONFIG_EXT4_ENCRYPTION must be enabled in the
+ ext4 filesystem, CONFIG_FS_ENCRYPTION must be enabled in the
kernel config, and the superblock must have had the "encrypt"
feature flag enabled using ``tune2fs -O encrypt`` or ``mkfs.ext4 -O
encrypt``.)
@@ -451,10 +451,18 @@ astute users may notice some differences in behavior:
- Unencrypted files, or files encrypted with a different encryption
policy (i.e. different key, modes, or flags), cannot be renamed or
linked into an encrypted directory; see `Encryption policy
- enforcement`_. Attempts to do so will fail with EPERM. However,
+ enforcement`_. Attempts to do so will fail with EXDEV. However,
encrypted files can be renamed within an encrypted directory, or
into an unencrypted directory.
+ Note: "moving" an unencrypted file into an encrypted directory, e.g.
+ with the `mv` program, is implemented in userspace by a copy
+ followed by a delete. Be aware that the original unencrypted data
+ may remain recoverable from free space on the disk; prefer to keep
+ all files encrypted from the very beginning. The `shred` program
+ may be used to overwrite the source files but isn't guaranteed to be
+ effective on all filesystems and storage devices.
+
- Direct I/O is not supported on encrypted files. Attempts to use
direct I/O on such files will fall back to buffered I/O.
@@ -541,7 +549,7 @@ not be encrypted.
Except for those special files, it is forbidden to have unencrypted
files, or files encrypted with a different encryption policy, in an
encrypted directory tree. Attempts to link or rename such a file into
-an encrypted directory will fail with EPERM. This is also enforced
+an encrypted directory will fail with EXDEV. This is also enforced
during ->lookup() to provide limited protection against offline
attacks that try to disable or downgrade encryption in known locations
where applications may later write sensitive data. It is recommended
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 605befab300b..1131c34d77f6 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -1,382 +1,43 @@
-=====================
-Linux Filesystems API
-=====================
+===============================
+Filesystems in the Linux kernel
+===============================
-The Linux VFS
-=============
+This under-development manual will, some glorious day, provide
+comprehensive information on how the Linux virtual filesystem (VFS) layer
+works, along with the filesystems that sit below it. For now, what we have
+can be found below.
-The Filesystem types
---------------------
-
-.. kernel-doc:: include/linux/fs.h
- :internal:
-
-The Directory Cache
--------------------
-
-.. kernel-doc:: fs/dcache.c
- :export:
-
-.. kernel-doc:: include/linux/dcache.h
- :internal:
-
-Inode Handling
---------------
-
-.. kernel-doc:: fs/inode.c
- :export:
-
-.. kernel-doc:: fs/bad_inode.c
- :export:
-
-Registration and Superblocks
-----------------------------
-
-.. kernel-doc:: fs/super.c
- :export:
-
-File Locks
-----------
-
-.. kernel-doc:: fs/locks.c
- :export:
-
-.. kernel-doc:: fs/locks.c
- :internal:
-
-Other Functions
----------------
-
-.. kernel-doc:: fs/mpage.c
- :export:
-
-.. kernel-doc:: fs/namei.c
- :export:
-
-.. kernel-doc:: fs/buffer.c
- :export:
-
-.. kernel-doc:: block/bio.c
- :export:
-
-.. kernel-doc:: fs/seq_file.c
- :export:
-
-.. kernel-doc:: fs/filesystems.c
- :export:
-
-.. kernel-doc:: fs/fs-writeback.c
- :export:
-
-.. kernel-doc:: fs/block_dev.c
- :export:
-
-.. kernel-doc:: fs/anon_inodes.c
- :export:
-
-.. kernel-doc:: fs/attr.c
- :export:
-
-.. kernel-doc:: fs/d_path.c
- :export:
-
-.. kernel-doc:: fs/dax.c
- :export:
-
-.. kernel-doc:: fs/direct-io.c
- :export:
-
-.. kernel-doc:: fs/file_table.c
- :export:
-
-.. kernel-doc:: fs/libfs.c
- :export:
-
-.. kernel-doc:: fs/posix_acl.c
- :export:
-
-.. kernel-doc:: fs/stat.c
- :export:
-
-.. kernel-doc:: fs/sync.c
- :export:
-
-.. kernel-doc:: fs/xattr.c
- :export:
-
-The proc filesystem
-===================
-
-sysctl interface
-----------------
-
-.. kernel-doc:: kernel/sysctl.c
- :export:
-
-proc filesystem interface
--------------------------
-
-.. kernel-doc:: fs/proc/base.c
- :internal:
-
-Events based on file descriptors
-================================
-
-.. kernel-doc:: fs/eventfd.c
- :export:
-
-The Filesystem for Exporting Kernel Objects
-===========================================
-
-.. kernel-doc:: fs/sysfs/file.c
- :export:
-
-.. kernel-doc:: fs/sysfs/symlink.c
- :export:
-
-The debugfs filesystem
+Core VFS documentation
======================
-debugfs interface
------------------
+See these manuals for documentation about the VFS layer itself and how its
+algorithms work.
-.. kernel-doc:: fs/debugfs/inode.c
- :export:
+.. toctree::
+ :maxdepth: 2
-.. kernel-doc:: fs/debugfs/file.c
- :export:
+ path-lookup.rst
+ api-summary
+ splice
-The Linux Journalling API
+Filesystem support layers
=========================
-Overview
---------
-
-Details
-~~~~~~~
-
-The journalling layer is easy to use. You need to first of all create a
-journal_t data structure. There are two calls to do this dependent on
-how you decide to allocate the physical media on which the journal
-resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
-filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
-for journal stored on a raw device (in a continuous range of blocks). A
-journal_t is a typedef for a struct pointer, so when you are finally
-finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
-any used kernel memory.
-
-Once you have got your journal_t object you need to 'mount' or load the
-journal file. The journalling layer expects the space for the journal
-was already allocated and initialized properly by the userspace tools.
-When loading the journal you must call :c:func:`jbd2_journal_load` to process
-journal contents. If the client file system detects the journal contents
-does not need to be processed (or even need not have valid contents), it
-may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
-calling :c:func:`jbd2_journal_load`.
-
-Note that jbd2_journal_wipe(..,0) calls
-:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
-transactions in the journal and similarly :c:func:`jbd2_journal_load` will
-call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
-:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
-
-Now you can go ahead and start modifying the underlying filesystem.
-Almost.
-
-You still need to actually journal your filesystem changes, this is done
-by wrapping them into transactions. Additionally you also need to wrap
-the modification of each of the buffers with calls to the journal layer,
-so it knows what the modifications you are actually making are. To do
-this use :c:func:`jbd2_journal_start` which returns a transaction handle.
-
-:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
-which indicates the end of a transaction are nestable calls, so you can
-reenter a transaction if necessary, but remember you must call
-:c:func:`jbd2_journal_stop` the same number of times as
-:c:func:`jbd2_journal_start` before the transaction is completed (or more
-accurately leaves the update phase). Ext4/VFS makes use of this feature to
-simplify handling of inode dirtying, quota support, etc.
-
-Inside each transaction you need to wrap the modifications to the
-individual buffers (blocks). Before you start to modify a buffer you
-need to call :c:func:`jbd2_journal_get_create_access()` /
-:c:func:`jbd2_journal_get_write_access()` /
-:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
-journalling layer to copy the unmodified
-data if it needs to. After all the buffer may be part of a previously
-uncommitted transaction. At this point you are at last ready to modify a
-buffer, and once you are have done so you need to call
-:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
-buffer you now know is now longer required to be pushed back on the
-device you can call :c:func:`jbd2_journal_forget` in much the same way as you
-might have used :c:func:`bforget` in the past.
-
-A :c:func:`jbd2_journal_flush` may be called at any time to commit and
-checkpoint all your transactions.
-
-Then at umount time , in your :c:func:`put_super` you can then call
-:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
-
-Unfortunately there a couple of ways the journal layer can cause a
-deadlock. The first thing to note is that each task can only have a
-single outstanding transaction at any one time, remember nothing commits
-until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
-the transaction at the end of each file/inode/address etc. operation you
-perform, so that the journalling system isn't re-entered on another
-journal. Since transactions can't be nested/batched across differing
-journals, and another filesystem other than yours (say ext4) may be
-modified in a later syscall.
-
-The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
-if there isn't enough space in the journal for your transaction (based
-on the passed nblocks param) - when it blocks it merely(!) needs to wait
-for transactions to complete and be committed from other tasks, so
-essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
-deadlocks you must treat :c:func:`jbd2_journal_start` /
-:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
-your semaphore ordering rules to prevent
-deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
-behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
-easily as on :c:func:`jbd2_journal_start`.
-
-Try to reserve the right number of blocks the first time. ;-). This will
-be the maximum number of blocks you are going to touch in this
-transaction. I advise having a look at at least ext4_jbd.h to see the
-basis on which ext4 uses to make these decisions.
-
-Another wriggle to watch out for is your on-disk block allocation
-strategy. Why? Because, if you do a delete, you need to ensure you
-haven't reused any of the freed blocks until the transaction freeing
-these blocks commits. If you reused these blocks and crash happens,
-there is no way to restore the contents of the reallocated blocks at the
-end of the last fully committed transaction. One simple way of doing
-this is to mark blocks as free in internal in-memory block allocation
-structures only after the transaction freeing them commits. Ext4 uses
-journal commit callback for this purpose.
-
-With journal commit callbacks you can ask the journalling layer to call
-a callback function when the transaction is finally committed to disk,
-so that you can do some of your own management. You ask the journalling
-layer for calling the callback by simply setting
-``journal->j_commit_callback`` function pointer and that function is
-called after each transaction commit. You can also use
-``transaction->t_private_list`` for attaching entries to a transaction
-that need processing when the transaction commits.
-
-JBD2 also provides a way to block all transaction updates via
-:c:func:`jbd2_journal_lock_updates()` /
-:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
-window with a clean and stable fs for a moment. E.g.
-
-::
-
-
- jbd2_journal_lock_updates() //stop new stuff happening..
- jbd2_journal_flush() // checkpoint everything.
- ..do stuff on stable fs
- jbd2_journal_unlock_updates() // carry on with filesystem use.
-
-The opportunities for abuse and DOS attacks with this should be obvious,
-if you allow unprivileged userspace to trigger codepaths containing
-these calls.
-
-Summary
-~~~~~~~
-
-Using the journal is a matter of wrapping the different context changes,
-being each mount, each modification (transaction) and each changed
-buffer to tell the journalling layer about them.
-
-Data Types
-----------
-
-The journalling layer uses typedefs to 'hide' the concrete definitions
-of the structures used. As a client of the JBD2 layer you can just rely
-on the using the pointer as a magic cookie of some sort. Obviously the
-hiding is not enforced as this is 'C'.
-
-Structures
-~~~~~~~~~~
-
-.. kernel-doc:: include/linux/jbd2.h
- :internal:
-
-Functions
----------
-
-The functions here are split into two groups those that affect a journal
-as a whole, and those which are used to manage transactions
-
-Journal Level
-~~~~~~~~~~~~~
-
-.. kernel-doc:: fs/jbd2/journal.c
- :export:
-
-.. kernel-doc:: fs/jbd2/recovery.c
- :internal:
-
-Transasction Level
-~~~~~~~~~~~~~~~~~~
-
-.. kernel-doc:: fs/jbd2/transaction.c
-
-See also
---------
-
-`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen
-Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__
-
-`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen
-Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__
-
-splice API
-==========
-
-splice is a method for moving blocks of data around inside the kernel,
-without continually transferring them between the kernel and user space.
-
-.. kernel-doc:: fs/splice.c
-
-pipes API
-=========
-
-Pipe interfaces are all for in-kernel (builtin image) use. They are not
-exported for use by modules.
-
-.. kernel-doc:: include/linux/pipe_fs_i.h
- :internal:
-
-.. kernel-doc:: fs/pipe.c
-
-Encryption API
-==============
-
-A library which filesystems can hook into to support transparent
-encryption of files and directories.
+Documentation for the support code within the filesystem layer for use in
+filesystem implementations.
.. toctree::
- :maxdepth: 2
-
- fscrypt
-
-Pathname lookup
-===============
-
-
-This write-up is based on three articles published at lwn.net:
+ :maxdepth: 2
-- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
-- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
-- <https://lwn.net/Articles/650786/> A walk among the symlinks
+ journalling
+ fscrypt
-Written by Neil Brown with help from Al Viro and Jon Corbet.
-It has subsequently been updated to reflect changes in the kernel
-including:
+Filesystem-specific documentation
+=================================
-- per-directory parallel name lookup.
+Documentation for individual filesystem types can be found here.
.. toctree::
:maxdepth: 2
- path-lookup.rst
+ binderfs.rst
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
new file mode 100644
index 000000000000..58ce6b395206
--- /dev/null
+++ b/Documentation/filesystems/journalling.rst
@@ -0,0 +1,184 @@
+The Linux Journalling API
+=========================
+
+Overview
+--------
+
+Details
+~~~~~~~
+
+The journalling layer is easy to use. You need to first of all create a
+journal_t data structure. There are two calls to do this dependent on
+how you decide to allocate the physical media on which the journal
+resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
+filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
+for journal stored on a raw device (in a continuous range of blocks). A
+journal_t is a typedef for a struct pointer, so when you are finally
+finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
+any used kernel memory.
+
+Once you have got your journal_t object you need to 'mount' or load the
+journal file. The journalling layer expects the space for the journal
+was already allocated and initialized properly by the userspace tools.
+When loading the journal you must call :c:func:`jbd2_journal_load` to process
+journal contents. If the client file system detects the journal contents
+does not need to be processed (or even need not have valid contents), it
+may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
+calling :c:func:`jbd2_journal_load`.
+
+Note that jbd2_journal_wipe(..,0) calls
+:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
+transactions in the journal and similarly :c:func:`jbd2_journal_load` will
+call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
+:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
+
+Now you can go ahead and start modifying the underlying filesystem.
+Almost.
+
+You still need to actually journal your filesystem changes, this is done
+by wrapping them into transactions. Additionally you also need to wrap
+the modification of each of the buffers with calls to the journal layer,
+so it knows what the modifications you are actually making are. To do
+this use :c:func:`jbd2_journal_start` which returns a transaction handle.
+
+:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
+which indicates the end of a transaction are nestable calls, so you can
+reenter a transaction if necessary, but remember you must call
+:c:func:`jbd2_journal_stop` the same number of times as
+:c:func:`jbd2_journal_start` before the transaction is completed (or more
+accurately leaves the update phase). Ext4/VFS makes use of this feature to
+simplify handling of inode dirtying, quota support, etc.
+
+Inside each transaction you need to wrap the modifications to the
+individual buffers (blocks). Before you start to modify a buffer you
+need to call :c:func:`jbd2_journal_get_create_access()` /
+:c:func:`jbd2_journal_get_write_access()` /
+:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
+journalling layer to copy the unmodified
+data if it needs to. After all the buffer may be part of a previously
+uncommitted transaction. At this point you are at last ready to modify a
+buffer, and once you are have done so you need to call
+:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
+buffer you now know is now longer required to be pushed back on the
+device you can call :c:func:`jbd2_journal_forget` in much the same way as you
+might have used :c:func:`bforget` in the past.
+
+A :c:func:`jbd2_journal_flush` may be called at any time to commit and
+checkpoint all your transactions.
+
+Then at umount time , in your :c:func:`put_super` you can then call
+:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
+
+Unfortunately there a couple of ways the journal layer can cause a
+deadlock. The first thing to note is that each task can only have a
+single outstanding transaction at any one time, remember nothing commits
+until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
+the transaction at the end of each file/inode/address etc. operation you
+perform, so that the journalling system isn't re-entered on another
+journal. Since transactions can't be nested/batched across differing
+journals, and another filesystem other than yours (say ext4) may be
+modified in a later syscall.
+
+The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
+if there isn't enough space in the journal for your transaction (based
+on the passed nblocks param) - when it blocks it merely(!) needs to wait
+for transactions to complete and be committed from other tasks, so
+essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
+deadlocks you must treat :c:func:`jbd2_journal_start` /
+:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
+your semaphore ordering rules to prevent
+deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
+behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
+easily as on :c:func:`jbd2_journal_start`.
+
+Try to reserve the right number of blocks the first time. ;-). This will
+be the maximum number of blocks you are going to touch in this
+transaction. I advise having a look at at least ext4_jbd.h to see the
+basis on which ext4 uses to make these decisions.
+
+Another wriggle to watch out for is your on-disk block allocation
+strategy. Why? Because, if you do a delete, you need to ensure you
+haven't reused any of the freed blocks until the transaction freeing
+these blocks commits. If you reused these blocks and crash happens,
+there is no way to restore the contents of the reallocated blocks at the
+end of the last fully committed transaction. One simple way of doing
+this is to mark blocks as free in internal in-memory block allocation
+structures only after the transaction freeing them commits. Ext4 uses
+journal commit callback for this purpose.
+
+With journal commit callbacks you can ask the journalling layer to call
+a callback function when the transaction is finally committed to disk,
+so that you can do some of your own management. You ask the journalling
+layer for calling the callback by simply setting
+``journal->j_commit_callback`` function pointer and that function is
+called after each transaction commit. You can also use
+``transaction->t_private_list`` for attaching entries to a transaction
+that need processing when the transaction commits.
+
+JBD2 also provides a way to block all transaction updates via
+:c:func:`jbd2_journal_lock_updates()` /
+:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
+window with a clean and stable fs for a moment. E.g.
+
+::
+
+
+ jbd2_journal_lock_updates() //stop new stuff happening..
+ jbd2_journal_flush() // checkpoint everything.
+ ..do stuff on stable fs
+ jbd2_journal_unlock_updates() // carry on with filesystem use.
+
+The opportunities for abuse and DOS attacks with this should be obvious,
+if you allow unprivileged userspace to trigger codepaths containing
+these calls.
+
+Summary
+~~~~~~~
+
+Using the journal is a matter of wrapping the different context changes,
+being each mount, each modification (transaction) and each changed
+buffer to tell the journalling layer about them.
+
+Data Types
+----------
+
+The journalling layer uses typedefs to 'hide' the concrete definitions
+of the structures used. As a client of the JBD2 layer you can just rely
+on the using the pointer as a magic cookie of some sort. Obviously the
+hiding is not enforced as this is 'C'.
+
+Structures
+~~~~~~~~~~
+
+.. kernel-doc:: include/linux/jbd2.h
+ :internal:
+
+Functions
+---------
+
+The functions here are split into two groups those that affect a journal
+as a whole, and those which are used to manage transactions
+
+Journal Level
+~~~~~~~~~~~~~
+
+.. kernel-doc:: fs/jbd2/journal.c
+ :export:
+
+.. kernel-doc:: fs/jbd2/recovery.c
+ :internal:
+
+Transasction Level
+~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: fs/jbd2/transaction.c
+
+See also
+--------
+
+`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen
+Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__
+
+`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen
+Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__
+
diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.txt
new file mode 100644
index 000000000000..944d1965e917
--- /dev/null
+++ b/Documentation/filesystems/mount_api.txt
@@ -0,0 +1,709 @@
+ ====================
+ FILESYSTEM MOUNT API
+ ====================
+
+CONTENTS
+
+ (1) Overview.
+
+ (2) The filesystem context.
+
+ (3) The filesystem context operations.
+
+ (4) Filesystem context security.
+
+ (5) VFS filesystem context operations.
+
+ (6) Parameter description.
+
+ (7) Parameter helper functions.
+
+
+========
+OVERVIEW
+========
+
+The creation of new mounts is now to be done in a multistep process:
+
+ (1) Create a filesystem context.
+
+ (2) Parse the parameters and attach them to the context. Parameters are
+ expected to be passed individually from userspace, though legacy binary
+ parameters can also be handled.
+
+ (3) Validate and pre-process the context.
+
+ (4) Get or create a superblock and mountable root.
+
+ (5) Perform the mount.
+
+ (6) Return an error message attached to the context.
+
+ (7) Destroy the context.
+
+To support this, the file_system_type struct gains a new field:
+
+ int (*init_fs_context)(struct fs_context *fc);
+
+which is invoked to set up the filesystem-specific parts of a filesystem
+context, including the additional space.
+
+Note that security initialisation is done *after* the filesystem is called so
+that the namespaces may be adjusted first.
+
+
+======================
+THE FILESYSTEM CONTEXT
+======================
+
+The creation and reconfiguration of a superblock is governed by a filesystem
+context. This is represented by the fs_context structure:
+
+ struct fs_context {
+ const struct fs_context_operations *ops;
+ struct file_system_type *fs_type;
+ void *fs_private;
+ struct dentry *root;
+ struct user_namespace *user_ns;
+ struct net *net_ns;
+ const struct cred *cred;
+ char *source;
+ char *subtype;
+ void *security;
+ void *s_fs_info;
+ unsigned int sb_flags;
+ unsigned int sb_flags_mask;
+ enum fs_context_purpose purpose:8;
+ bool sloppy:1;
+ bool silent:1;
+ ...
+ };
+
+The fs_context fields are as follows:
+
+ (*) const struct fs_context_operations *ops
+
+ These are operations that can be done on a filesystem context (see
+ below). This must be set by the ->init_fs_context() file_system_type
+ operation.
+
+ (*) struct file_system_type *fs_type
+
+ A pointer to the file_system_type of the filesystem that is being
+ constructed or reconfigured. This retains a reference on the type owner.
+
+ (*) void *fs_private
+
+ A pointer to the file system's private data. This is where the filesystem
+ will need to store any options it parses.
+
+ (*) struct dentry *root
+
+ A pointer to the root of the mountable tree (and indirectly, the
+ superblock thereof). This is filled in by the ->get_tree() op. If this
+ is set, an active reference on root->d_sb must also be held.
+
+ (*) struct user_namespace *user_ns
+ (*) struct net *net_ns
+
+ There are a subset of the namespaces in use by the invoking process. They
+ retain references on each namespace. The subscribed namespaces may be
+ replaced by the filesystem to reflect other sources, such as the parent
+ mount superblock on an automount.
+
+ (*) const struct cred *cred
+
+ The mounter's credentials. This retains a reference on the credentials.
+
+ (*) char *source
+
+ This specifies the source. It may be a block device (e.g. /dev/sda1) or
+ something more exotic, such as the "host:/path" that NFS desires.
+
+ (*) char *subtype
+
+ This is a string to be added to the type displayed in /proc/mounts to
+ qualify it (used by FUSE). This is available for the filesystem to set if
+ desired.
+
+ (*) void *security
+
+ A place for the LSMs to hang their security data for the superblock. The
+ relevant security operations are described below.
+
+ (*) void *s_fs_info
+
+ The proposed s_fs_info for a new superblock, set in the superblock by
+ sget_fc(). This can be used to distinguish superblocks.
+
+ (*) unsigned int sb_flags
+ (*) unsigned int sb_flags_mask
+
+ Which bits SB_* flags are to be set/cleared in super_block::s_flags.
+
+ (*) enum fs_context_purpose
+
+ This indicates the purpose for which the context is intended. The
+ available values are:
+
+ FS_CONTEXT_FOR_MOUNT, -- New superblock for explicit mount
+ FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
+ FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
+
+ (*) bool sloppy
+ (*) bool silent
+
+ These are set if the sloppy or silent mount options are given.
+
+ [NOTE] sloppy is probably unnecessary when userspace passes over one
+ option at a time since the error can just be ignored if userspace deems it
+ to be unimportant.
+
+ [NOTE] silent is probably redundant with sb_flags & SB_SILENT.
+
+The mount context is created by calling vfs_new_fs_context() or
+vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
+structure is not refcounted.
+
+VFS, security and filesystem mount options are set individually with
+vfs_parse_mount_option(). Options provided by the old mount(2) system call as
+a page of data can be parsed with generic_parse_monolithic().
+
+When mounting, the filesystem is allowed to take data from any of the pointers
+and attach it to the superblock (or whatever), provided it clears the pointer
+in the mount context.
+
+The filesystem is also allowed to allocate resources and pin them with the
+mount context. For instance, NFS might pin the appropriate protocol version
+module.
+
+
+=================================
+THE FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+The filesystem context points to a table of operations:
+
+ struct fs_context_operations {
+ void (*free)(struct fs_context *fc);
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ int (*parse_param)(struct fs_context *fc,
+ struct struct fs_parameter *param);
+ int (*parse_monolithic)(struct fs_context *fc, void *data);
+ int (*get_tree)(struct fs_context *fc);
+ int (*reconfigure)(struct fs_context *fc);
+ };
+
+These operations are invoked by the various stages of the mount procedure to
+manage the filesystem context. They are as follows:
+
+ (*) void (*free)(struct fs_context *fc);
+
+ Called to clean up the filesystem-specific part of the filesystem context
+ when the context is destroyed. It should be aware that parts of the
+ context may have been removed and NULL'd out by ->get_tree().
+
+ (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+
+ Called when a filesystem context has been duplicated to duplicate the
+ filesystem-private data. An error may be returned to indicate failure to
+ do this.
+
+ [!] Note that even if this fails, put_fs_context() will be called
+ immediately thereafter, so ->dup() *must* make the
+ filesystem-private data safe for ->free().
+
+ (*) int (*parse_param)(struct fs_context *fc,
+ struct struct fs_parameter *param);
+
+ Called when a parameter is being added to the filesystem context. param
+ points to the key name and maybe a value object. VFS-specific options
+ will have been weeded out and fc->sb_flags updated in the context.
+ Security options will also have been weeded out and fc->security updated.
+
+ The parameter can be parsed with fs_parse() and fs_lookup_param(). Note
+ that the source(s) are presented as parameters named "source".
+
+ If successful, 0 should be returned or a negative error code otherwise.
+
+ (*) int (*parse_monolithic)(struct fs_context *fc, void *data);
+
+ Called when the mount(2) system call is invoked to pass the entire data
+ page in one go. If this is expected to be just a list of "key[=val]"
+ items separated by commas, then this may be set to NULL.
+
+ The return value is as for ->parse_param().
+
+ If the filesystem (e.g. NFS) needs to examine the data first and then
+ finds it's the standard key-val list then it may pass it off to
+ generic_parse_monolithic().
+
+ (*) int (*get_tree)(struct fs_context *fc);
+
+ Called to get or create the mountable root and superblock, using the
+ information stored in the filesystem context (reconfiguration goes via a
+ different vector). It may detach any resources it desires from the
+ filesystem context and transfer them to the superblock it creates.
+
+ On success it should set fc->root to the mountable root and return 0. In
+ the case of an error, it should return a negative error code.
+
+ The phase on a userspace-driven context will be set to only allow this to
+ be called once on any particular context.
+
+ (*) int (*reconfigure)(struct fs_context *fc);
+
+ Called to effect reconfiguration of a superblock using information stored
+ in the filesystem context. It may detach any resources it desires from
+ the filesystem context and transfer them to the superblock. The
+ superblock can be found from fc->root->d_sb.
+
+ On success it should return 0. In the case of an error, it should return
+ a negative error code.
+
+ [NOTE] reconfigure is intended as a replacement for remount_fs.
+
+
+===========================
+FILESYSTEM CONTEXT SECURITY
+===========================
+
+The filesystem context contains a security pointer that the LSMs can use for
+building up a security context for the superblock to be mounted. There are a
+number of operations used by the new mount code for this purpose:
+
+ (*) int security_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. It should return 0 on success or a negative error
+ code on failure.
+
+ reference will be non-NULL if the context is being created for superblock
+ reconfiguration (FS_CONTEXT_FOR_RECONFIGURE) in which case it indicates
+ the root dentry of the superblock to be reconfigured. It will also be
+ non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case
+ it indicates the automount point.
+
+ (*) int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc);
+
+ Called to initialise fc->security (which is preset to NULL) and allocate
+ any resources needed. The original filesystem context is pointed to by
+ src_fc and may be used for reference. It should return 0 on success or a
+ negative error code on failure.
+
+ (*) void security_fs_context_free(struct fs_context *fc);
+
+ Called to clean up anything attached to fc->security. Note that the
+ contents may have been transferred to a superblock and the pointer cleared
+ during get_tree.
+
+ (*) int security_fs_context_parse_param(struct fs_context *fc,
+ struct fs_parameter *param);
+
+ Called for each mount parameter, including the source. The arguments are
+ as for the ->parse_param() method. It should return 0 to indicate that
+ the parameter should be passed on to the filesystem, 1 to indicate that
+ the parameter should be discarded or an error to indicate that the
+ parameter should be rejected.
+
+ The value pointed to by param may be modified (if a string) or stolen
+ (provided the value pointer is NULL'd out). If it is stolen, 1 must be
+ returned to prevent it being passed to the filesystem.
+
+ (*) int security_fs_context_validate(struct fs_context *fc);
+
+ Called after all the options have been parsed to validate the collection
+ as a whole and to do any necessary allocation so that
+ security_sb_get_tree() and security_sb_reconfigure() are less likely to
+ fail. It should return 0 or a negative error code.
+
+ In the case of reconfiguration, the target superblock will be accessible
+ via fc->root.
+
+ (*) int security_sb_get_tree(struct fs_context *fc);
+
+ Called during the mount procedure to verify that the specified superblock
+ is allowed to be mounted and to transfer the security data there. It
+ should return 0 or a negative error code.
+
+ (*) void security_sb_reconfigure(struct fs_context *fc);
+
+ Called to apply any reconfiguration to an LSM's context. It must not
+ fail. Error checking and resource allocation must be done in advance by
+ the parameter parsing and validation hooks.
+
+ (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+ unsigned int mnt_flags);
+
+ Called during the mount procedure to verify that the root dentry attached
+ to the context is permitted to be attached to the specified mountpoint.
+ It should return 0 on success or a negative error code on failure.
+
+
+=================================
+VFS FILESYSTEM CONTEXT OPERATIONS
+=================================
+
+There are four operations for creating a filesystem context and
+one for destroying a context:
+
+ (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
+ struct dentry *reference,
+ unsigned int sb_flags,
+ unsigned int sb_flags_mask,
+ enum fs_context_purpose purpose);
+
+ Create a filesystem context for a given filesystem type and purpose. This
+ allocates the filesystem context, sets the superblock flags, initialises
+ the security and calls fs_type->init_fs_context() to initialise the
+ filesystem private data.
+
+ reference can be NULL or it may indicate the root dentry of a superblock
+ that is going to be reconfigured (FS_CONTEXT_FOR_RECONFIGURE) or
+ the automount point that triggered a submount (FS_CONTEXT_FOR_SUBMOUNT).
+ This is provided as a source of namespace information.
+
+ (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
+
+ Duplicate a filesystem context, copying any options noted and duplicating
+ or additionally referencing any resources held therein. This is available
+ for use where a filesystem has to get a mount within a mount, such as NFS4
+ does by internally mounting the root of the target server and then doing a
+ private pathwalk to the target directory.
+
+ The purpose in the new context is inherited from the old one.
+
+ (*) void put_fs_context(struct fs_context *fc);
+
+ Destroy a filesystem context, releasing any resources it holds. This
+ calls the ->free() operation. This is intended to be called by anyone who
+ created a filesystem context.
+
+ [!] filesystem contexts are not refcounted, so this causes unconditional
+ destruction.
+
+In all the above operations, apart from the put op, the return is a mount
+context pointer or a negative error code.
+
+For the remaining operations, if an error occurs, a negative error code will be
+returned.
+
+ (*) int vfs_get_tree(struct fs_context *fc);
+
+ Get or create the mountable root and superblock, using the parameters in
+ the filesystem context to select/configure the superblock. This invokes
+ the ->validate() op and then the ->get_tree() op.
+
+ [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
+ ->reconfigure().
+
+ (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
+
+ Create a mount given the parameters in the specified filesystem context.
+ Note that this does not attach the mount to anything.
+
+ (*) int vfs_parse_fs_param(struct fs_context *fc,
+ struct fs_parameter *param);
+
+ Supply a single mount parameter to the filesystem context. This include
+ the specification of the source/device which is specified as the "source"
+ parameter (which may be specified multiple times if the filesystem
+ supports that).
+
+ param specifies the parameter key name and the value. The parameter is
+ first checked to see if it corresponds to a standard mount flag (in which
+ case it is used to set an SB_xxx flag and consumed) or a security option
+ (in which case the LSM consumes it) before it is passed on to the
+ filesystem.
+
+ The parameter value is typed and can be one of:
+
+ fs_value_is_flag, Parameter not given a value.
+ fs_value_is_string, Value is a string
+ fs_value_is_blob, Value is a binary blob
+ fs_value_is_filename, Value is a filename* + dirfd
+ fs_value_is_filename_empty, Value is a filename* + dirfd + AT_EMPTY_PATH
+ fs_value_is_file, Value is an open file (file*)
+
+ If there is a value, that value is stored in a union in the struct in one
+ of param->{string,blob,name,file}. Note that the function may steal and
+ clear the pointer, but then becomes responsible for disposing of the
+ object.
+
+ (*) int vfs_parse_fs_string(struct fs_context *fc, char *key,
+ const char *value, size_t v_size);
+
+ A wrapper around vfs_parse_fs_param() that just passes a constant string.
+
+ (*) int generic_parse_monolithic(struct fs_context *fc, void *data);
+
+ Parse a sys_mount() data page, assuming the form to be a text list
+ consisting of key[=val] options separated by commas. Each item in the
+ list is passed to vfs_mount_option(). This is the default when the
+ ->parse_monolithic() operation is NULL.
+
+
+=====================
+PARAMETER DESCRIPTION
+=====================
+
+Parameters are described using structures defined in linux/fs_parser.h.
+There's a core description struct that links everything together:
+
+ struct fs_parameter_description {
+ const char name[16];
+ u8 nr_params;
+ u8 nr_alt_keys;
+ u8 nr_enums;
+ bool ignore_unknown;
+ bool no_source;
+ const char *const *keys;
+ const struct constant_table *alt_keys;
+ const struct fs_parameter_spec *specs;
+ const struct fs_parameter_enum *enums;
+ };
+
+For example:
+
+ enum afs_param {
+ Opt_autocell,
+ Opt_bar,
+ Opt_dyn,
+ Opt_foo,
+ Opt_source,
+ nr__afs_params
+ };
+
+ static const struct fs_parameter_description afs_fs_parameters = {
+ .name = "kAFS",
+ .nr_params = nr__afs_params,
+ .nr_alt_keys = ARRAY_SIZE(afs_param_alt_keys),
+ .nr_enums = ARRAY_SIZE(afs_param_enums),
+ .keys = afs_param_keys,
+ .alt_keys = afs_param_alt_keys,
+ .specs = afs_param_specs,
+ .enums = afs_param_enums,
+ };
+
+The members are as follows:
+
+ (1) const char name[16];
+
+ The name to be used in error messages generated by the parse helper
+ functions.
+
+ (2) u8 nr_params;
+
+ The number of discrete parameter identifiers. This indicates the number
+ of elements in the ->types[] array and also limits the values that may be
+ used in the values that the ->keys[] array maps to.
+
+ It is expected that, for example, two parameters that are related, say
+ "acl" and "noacl" with have the same ID, but will be flagged to indicate
+ that one is the inverse of the other. The value can then be picked out
+ from the parse result.
+
+ (3) const struct fs_parameter_specification *specs;
+
+ Table of parameter specifications, where the entries are of type:
+
+ struct fs_parameter_type {
+ enum fs_parameter_spec type:8;
+ u8 flags;
+ };
+
+ and the parameter identifier is the index to the array. 'type' indicates
+ the desired value type and must be one of:
+
+ TYPE NAME EXPECTED VALUE RESULT IN
+ ======================= ======================= =====================
+ fs_param_is_flag No value n/a
+ fs_param_is_bool Boolean value result->boolean
+ fs_param_is_u32 32-bit unsigned int result->uint_32
+ fs_param_is_u32_octal 32-bit octal int result->uint_32
+ fs_param_is_u32_hex 32-bit hex int result->uint_32
+ fs_param_is_s32 32-bit signed int result->int_32
+ fs_param_is_enum Enum value name result->uint_32
+ fs_param_is_string Arbitrary string param->string
+ fs_param_is_blob Binary blob param->blob
+ fs_param_is_blockdev Blockdev path * Needs lookup
+ fs_param_is_path Path * Needs lookup
+ fs_param_is_fd File descriptor param->file
+
+ And each parameter can be qualified with 'flags':
+
+ fs_param_v_optional The value is optional
+ fs_param_neg_with_no If key name is prefixed with "no", it is false
+ fs_param_neg_with_empty If value is "", it is false
+ fs_param_deprecated The parameter is deprecated.
+
+ For example:
+
+ static const struct fs_parameter_spec afs_param_specs[nr__afs_params] = {
+ [Opt_autocell] = { fs_param_is flag },
+ [Opt_bar] = { fs_param_is_enum },
+ [Opt_dyn] = { fs_param_is flag },
+ [Opt_foo] = { fs_param_is_bool, fs_param_neg_with_no },
+ [Opt_source] = { fs_param_is_string },
+ };
+
+ Note that if the value is of fs_param_is_bool type, fs_parse() will try
+ to match any string value against "0", "1", "no", "yes", "false", "true".
+
+ [!] NOTE that the table must be sorted according to primary key name so
+ that ->keys[] is also sorted.
+
+ (4) const char *const *keys;
+
+ Table of primary key names for the parameters. There must be one entry
+ per defined parameter. The table is optional if ->nr_params is 0. The
+ table is just an array of names e.g.:
+
+ static const char *const afs_param_keys[nr__afs_params] = {
+ [Opt_autocell] = "autocell",
+ [Opt_bar] = "bar",
+ [Opt_dyn] = "dyn",
+ [Opt_foo] = "foo",
+ [Opt_source] = "source",
+ };
+
+ [!] NOTE that the table must be sorted such that the table can be searched
+ with bsearch() using strcmp(). This means that the Opt_* values must
+ correspond to the entries in this table.
+
+ (5) const struct constant_table *alt_keys;
+ u8 nr_alt_keys;
+
+ Table of additional key names and their mappings to parameter ID plus the
+ number of elements in the table. This is optional. The table is just an
+ array of { name, integer } pairs, e.g.:
+
+ static const struct constant_table afs_param_keys[] = {
+ { "baz", Opt_bar },
+ { "dynamic", Opt_dyn },
+ };
+
+ [!] NOTE that the table must be sorted such that strcmp() can be used with
+ bsearch() to search the entries.
+
+ The parameter ID can also be fs_param_key_removed to indicate that a
+ deprecated parameter has been removed and that an error will be given.
+ This differs from fs_param_deprecated where the parameter may still have
+ an effect.
+
+ Further, the behaviour of the parameter may differ when an alternate name
+ is used (for instance with NFS, "v3", "v4.2", etc. are alternate names).
+
+ (6) const struct fs_parameter_enum *enums;
+ u8 nr_enums;
+
+ Table of enum value names to integer mappings and the number of elements
+ stored therein. This is of type:
+
+ struct fs_parameter_enum {
+ u8 param_id;
+ char name[14];
+ u8 value;
+ };
+
+ Where the array is an unsorted list of { parameter ID, name }-keyed
+ elements that indicate the value to map to, e.g.:
+
+ static const struct fs_parameter_enum afs_param_enums[] = {
+ { Opt_bar, "x", 1},
+ { Opt_bar, "y", 23},
+ { Opt_bar, "z", 42},
+ };
+
+ If a parameter of type fs_param_is_enum is encountered, fs_parse() will
+ try to look the value up in the enum table and the result will be stored
+ in the parse result.
+
+ (7) bool no_source;
+
+ If this is set, fs_parse() will ignore any "source" parameter and not
+ pass it to the filesystem.
+
+The parser should be pointed to by the parser pointer in the file_system_type
+struct as this will provide validation on registration (if
+CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from
+userspace using the fsinfo() syscall.
+
+
+==========================
+PARAMETER HELPER FUNCTIONS
+==========================
+
+A number of helper functions are provided to help a filesystem or an LSM
+process the parameters it is given.
+
+ (*) int lookup_constant(const struct constant_table tbl[],
+ const char *name, int not_found);
+
+ Look up a constant by name in a table of name -> integer mappings. The
+ table is an array of elements of the following type:
+
+ struct constant_table {
+ const char *name;
+ int value;
+ };
+
+ and it must be sorted such that it can be searched using bsearch() using
+ strcmp(). If a match is found, the corresponding value is returned. If a
+ match isn't found, the not_found value is returned instead.
+
+ (*) bool validate_constant_table(const struct constant_table *tbl,
+ size_t tbl_size,
+ int low, int high, int special);
+
+ Validate a constant table. Checks that all the elements are appropriately
+ ordered, that there are no duplicates and that the values are between low
+ and high inclusive, though provision is made for one allowable special
+ value outside of that range. If no special value is required, special
+ should just be set to lie inside the low-to-high range.
+
+ If all is good, true is returned. If the table is invalid, errors are
+ logged to dmesg, the stack is dumped and false is returned.
+
+ (*) int fs_parse(struct fs_context *fc,
+ const struct fs_param_parser *parser,
+ struct fs_parameter *param,
+ struct fs_param_parse_result *result);
+
+ This is the main interpreter of parameters. It uses the parameter
+ description (parser) to look up the name of the parameter to use and to
+ convert that to a parameter ID (stored in result->key).
+
+ If successful, and if the parameter type indicates the result is a
+ boolean, integer or enum type, the value is converted by this function and
+ the result stored in result->{boolean,int_32,uint_32}.
+
+ If a match isn't initially made, the key is prefixed with "no" and no
+ value is present then an attempt will be made to look up the key with the
+ prefix removed. If this matches a parameter for which the type has flag
+ fs_param_neg_with_no set, then a match will be made and the value will be
+ set to false/0/NULL.
+
+ If the parameter is successfully matched and, optionally, parsed
+ correctly, 1 is returned. If the parameter isn't matched and
+ parser->ignore_unknown is set, then 0 is returned. Otherwise -EINVAL is
+ returned.
+
+ (*) bool fs_validate_description(const struct fs_parameter_description *desc);
+
+ This is validates the parameter description. It returns true if the
+ description is good and false if it is not.
+
+ (*) int fs_lookup_param(struct fs_context *fc,
+ struct fs_parameter *value,
+ bool want_bdev,
+ struct path *_path);
+
+ This takes a parameter that carries a string or filename type and attempts
+ to do a path lookup on it. If the parameter expects a blockdev, a check
+ is made that the inode actually represents one.
+
+ Returns 0 if successful and *_path will be set; returns a negative error
+ code if not.
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 9d6b68853f5b..434a07b0002b 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -1,3 +1,18 @@
+===============
+Pathname lookup
+===============
+
+This write-up is based on three articles published at lwn.net:
+
+- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
+- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
+- <https://lwn.net/Articles/650786/> A walk among the symlinks
+
+Written by Neil Brown with help from Al Viro and Jon Corbet.
+It has subsequently been updated to reflect changes in the kernel
+including:
+
+- per-directory parallel name lookup.
Introduction to pathname lookup
===============================
@@ -344,7 +359,7 @@ In particular it is held while scanning chains in the dcache hash
table, and the mount point hash table.
Bringing it together with ``struct nameidata``
---------------------------------------------
+----------------------------------------------
.. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
@@ -355,7 +370,7 @@ converts a "name" to an "inode". ``struct nameidata`` contains (among
other fields):
``struct path path``
-~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~
A ``path`` contains a ``struct vfsmount`` (which is
embedded in a ``struct mount``) and a ``struct dentry``. Together these
@@ -366,13 +381,13 @@ step. A reference through ``d_lockref`` and ``mnt_count`` is always
held.
``struct qstr last``
-~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~
This is a string together with a length (i.e. _not_ ``nul`` terminated)
that is the "next" component in the pathname.
``int last_type``
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~
This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT``, ``LAST_DOTDOT``, or
``LAST_BIND``. The ``last`` field is only valid if the type is
@@ -381,7 +396,7 @@ components of the symlink have been processed yet. Others should be
fairly self-explanatory.
``struct path root``
-~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~
This is used to hold a reference to the effective root of the
filesystem. Often that reference won't be needed, so this field is
@@ -510,7 +525,7 @@ potentially interesting things about these dentries corresponding
to three different flags that might be set in ``dentry->d_flags``:
``DCACHE_MANAGE_TRANSIT``
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~
If this flag has been set, then the filesystem has requested that the
``d_manage()`` dentry operation be called before handling any possible
@@ -529,7 +544,7 @@ filesystem, which will then give it a special pass through
``d_manage()`` by returning ``-EISDIR``.
``DCACHE_MOUNTED``
-~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~
This flag is set on every dentry that is mounted on. As Linux
supports multiple filesystem namespaces, it is possible that the
@@ -542,7 +557,7 @@ If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
and a new ``dentry`` (both with counted references).
``DCACHE_NEED_AUTOMOUNT``
-~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~
If ``d_manage()`` allowed us to get this far, and ``lookup_mnt()`` didn't
find a mount point, then this flag causes the ``d_automount()`` dentry
@@ -698,7 +713,7 @@ With that little refresher on seqlocks out of the way we can look at
the bigger picture of how RCU-walk uses seqlocks.
``mount_lock`` and ``nd->m_seq``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We already met the ``mount_lock`` seqlock when REF-walk used it to
ensure that crossing a mount point is performed safely. RCU-walk uses
@@ -727,7 +742,7 @@ results would have been the same. This ensures the invariant holds,
at least for vfsmount structures.
``dentry->d_seq`` and ``nd->seq``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In place of taking a count or lock on ``d_reflock``, RCU-walk samples
the per-dentry ``d_seq`` seqlock, and stores the sequence number in the
@@ -774,7 +789,7 @@ getting a counted reference to the new dentry before dropping that for
the old dentry which we saw in REF-walk.
No ``inode->i_rwsem`` or even ``rename_lock``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A semaphore is a fairly heavyweight lock that can only be taken when it is
permissible to sleep. As ``rcu_read_lock()`` forbids sleeping,
@@ -796,7 +811,7 @@ locking. This neatly handles all cases, so adding extra checks on
rename_lock would bring no significant value.
``unlazy walk()`` and ``complete_walk()``
--------------------------------------
+-----------------------------------------
That "dropping down to REF-walk" typically involves a call to
``unlazy_walk()``, so named because "RCU-walk" is also sometimes
diff --git a/Documentation/filesystems/splice.rst b/Documentation/filesystems/splice.rst
new file mode 100644
index 000000000000..edd874808472
--- /dev/null
+++ b/Documentation/filesystems/splice.rst
@@ -0,0 +1,22 @@
+================
+splice and pipes
+================
+
+splice API
+==========
+
+splice is a method for moving blocks of data around inside the kernel,
+without continually transferring them between the kernel and user space.
+
+.. kernel-doc:: fs/splice.c
+
+pipes API
+=========
+
+Pipe interfaces are all for in-kernel (builtin image) use. They are not
+exported for use by modules.
+
+.. kernel-doc:: include/linux/pipe_fs_i.h
+ :internal:
+
+.. kernel-doc:: fs/pipe.c
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
index 41411b0c60a3..5b5311f9358d 100644
--- a/Documentation/filesystems/sysfs.txt
+++ b/Documentation/filesystems/sysfs.txt
@@ -116,6 +116,27 @@ static struct device_attribute dev_attr_foo = {
.store = store_foo,
};
+Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally
+considered a bad idea." so trying to set a sysfs file writable for
+everyone will fail reverting to RO mode for "Others".
+
+For the common cases sysfs.h provides convenience macros to make
+defining attributes easier as well as making code more concise and
+readable. The above case could be shortened to:
+
+static struct device_attribute dev_attr_foo = __ATTR_RW(foo);
+
+the list of helpers available to define your wrapper function is:
+__ATTR_RO(name): assumes default name_show and mode 0444
+__ATTR_WO(name): assumes a name_store only and is restricted to mode
+ 0200 that is root write access only.
+__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently
+ only use case is the EFI System Resource Table
+ (see drivers/firmware/efi/esrt.c)
+__ATTR_RW(name): assumes default name_show, name_store and setting
+ mode to 0644.
+__ATTR_NULL: which sets the name to NULL and is used as end of list
+ indicator (see: kernel/workqueue.c)
Subsystem-Specific Callbacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 8dc8e9c2913f..761c6fd24a53 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,7 @@ struct file_operations {
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+ int (*iopoll)(struct kiocb *kiocb, bool spin);
int (*iterate) (struct file *, struct dir_context *);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
@@ -902,6 +903,8 @@ otherwise noted.
write_iter: possibly asynchronous write with iov_iter as source
+ iopoll: called when aio wants to poll for completions on HIPRI iocbs
+
iterate: called when the VFS needs to read the directory contents
iterate_shared: called when the VFS needs to read the directory contents
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index 9ccfd1bc6201..a5cbb5e0e3db 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -272,7 +272,7 @@ The following sysctls are available for the XFS filesystem:
XFS_ERRLEVEL_LOW: 1
XFS_ERRLEVEL_HIGH: 5
- fs.xfs.panic_mask (Min: 0 Default: 0 Max: 255)
+ fs.xfs.panic_mask (Min: 0 Default: 0 Max: 256)
Causes certain error conditions to call BUG(). Value is a bitmask;
OR together the tags which represent errors which should cause panics:
@@ -285,6 +285,7 @@ The following sysctls are available for the XFS filesystem:
XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
XFS_PTAG_FSBLOCK_ZERO 0x00000080
+ XFS_PTAG_VERIFIER_ERROR 0x00000100
This option is intended for debugging only.