diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-01-10 17:45:22 +0100 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-01-10 17:45:22 +0100 |
commit | 12958e9c4c8e93ef694c10960c78453edf21526e (patch) | |
tree | 80e4a3a945358d9a546c3f5cda25bc5ceadcdf7c /Documentation/filesystems/xfs | |
parent | Merge tag 'fsnotify_for_v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kerne... (diff) | |
parent | xfs: use the op name in trace_xlog_intent_recovery_failed (diff) | |
download | linux-12958e9c4c8e93ef694c10960c78453edf21526e.tar.xz linux-12958e9c4c8e93ef694c10960c78453edf21526e.zip |
Merge tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Chandan Babu:
"New features/functionality:
- Online repair:
- Reserve disk space for online repairs
- Fix misinteraction between the AIL and btree bulkloader because
of which the bulk load fails to queue a buffer for writeback if
it happens to be on the AIL list
- Prevent transaction reservation overflows when reaping blocks
during online repair
- Whenever possible, bulkloader now copies multiple records into
a block
- Support repairing of
1. Per-AG free space, inode and refcount btrees
2. Ondisk inodes
3. File data and attribute fork mappings
- Verify the contents of
1. Inode and data fork of realtime bitmap file
2. Quota files
- Introduce MF_MEM_PRE_REMOVE. This will be used to notify tasks
about a pmem device being removed
Bug fixes:
- Fix memory leak of recovered attri intent items
- Fix UAF during log intent recovery
- Fix realtime geometry integer overflows
- Prevent scrub from live locking in xchk_iget
- Prevent fs shutdown when removing files during low free disk space
- Prevent transaction reservation overflow when extending an RT
device
- Prevent incorrect warning from being printed when extending a
filesystem
- Fix an off-by-one error in xreap_agextent_binval
- Serialize access to perag radix tree during deletion operation
- Fix perag memory leak during growfs
- Allow allocation of minlen realtime extent when the maximum sized
realtime free extent is minlen in size
Cleanups:
- Remove duplicate boilerplate code spread across functionality
associated with different log items
- Cleanup resblks interfaces
- Pass defer ops pointer to defer helpers instead of an enum
- Initialize di_crc in xfs_log_dinode to prevent KMSAN warnings
- Use static_assert() instead of BUILD_BUG_ON_MSG() to validate size
of structures and structure member offsets. This is done in order
to be able to share the code with userspace
- Move XFS documentation under a new directory specific to XFS
- Do not invoke deferred ops' ->create_done callback if the deferred
operation does not have an intent item associated with it
- Remove duplicate inclusion of header files from scrub/health.c
- Refactor Realtime code
- Cleanup attr code"
* tag 'xfs-6.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (123 commits)
xfs: use the op name in trace_xlog_intent_recovery_failed
xfs: fix a use after free in xfs_defer_finish_recovery
xfs: turn the XFS_DA_OP_REPLACE checks in xfs_attr_shortform_addname into asserts
xfs: remove xfs_attr_sf_hdr_t
xfs: remove struct xfs_attr_shortform
xfs: use xfs_attr_sf_findname in xfs_attr_shortform_getvalue
xfs: remove xfs_attr_shortform_lookup
xfs: simplify xfs_attr_sf_findname
xfs: move the xfs_attr_sf_lookup tracepoint
xfs: return if_data from xfs_idata_realloc
xfs: make if_data a void pointer
xfs: fold xfs_rtallocate_extent into xfs_bmap_rtalloc
xfs: simplify and optimize the RT allocation fallback cascade
xfs: reorder the minlen and prod calculations in xfs_bmap_rtalloc
xfs: remove XFS_RTMIN/XFS_RTMAX
xfs: remove rt-wrappers from xfs_format.h
xfs: factor out a xfs_rtalloc_sumlevel helper
xfs: tidy up xfs_rtallocate_extent_exact
xfs: merge the calls to xfs_rtallocate_range in xfs_rtallocate_block
xfs: reflow the tail end of xfs_rtallocate_extent_block
...
Diffstat (limited to 'Documentation/filesystems/xfs')
-rw-r--r-- | Documentation/filesystems/xfs/index.rst | 14 | ||||
-rw-r--r-- | Documentation/filesystems/xfs/xfs-delayed-logging-design.rst | 1087 | ||||
-rw-r--r-- | Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst | 194 | ||||
-rw-r--r-- | Documentation/filesystems/xfs/xfs-online-fsck-design.rst | 5315 | ||||
-rw-r--r-- | Documentation/filesystems/xfs/xfs-self-describing-metadata.rst | 353 |
5 files changed, 6963 insertions, 0 deletions
diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst new file mode 100644 index 000000000000..ab66c57a5d18 --- /dev/null +++ b/Documentation/filesystems/xfs/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +XFS Filesystem Documentation +============================ + +.. toctree:: + :maxdepth: 2 + :numbered: + + xfs-delayed-logging-design + xfs-maintainer-entry-profile + xfs-self-describing-metadata + xfs-online-fsck-design diff --git a/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst new file mode 100644 index 000000000000..6402ab8e370c --- /dev/null +++ b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst @@ -0,0 +1,1087 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +XFS Logging Design +================== + +Preamble +======== + +This document describes the design and algorithms that the XFS journalling +subsystem is based on. This document describes the design and algorithms that +the XFS journalling subsystem is based on so that readers may familiarize +themselves with the general concepts of how transaction processing in XFS works. + +We begin with an overview of transactions in XFS, followed by describing how +transaction reservations are structured and accounted, and then move into how we +guarantee forwards progress for long running transactions with finite initial +reservations bounds. At this point we need to explain how relogging works. With +the basic concepts covered, the design of the delayed logging mechanism is +documented. + + +Introduction +============ + +XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata +are atomic and recoverable. For reasons of space and time efficiency, the +logging mechanisms are varied and complex, combining intents, logical and +physical logging mechanisms to provide the necessary recovery guarantees the +filesystem requires. + +Some objects, such as inodes and dquots, are logged in logical format where the +details logged are made up of the changes to in-core structures rather than +on-disk structures. Other objects - typically buffers - have their physical +changes logged. Long running atomic modifications have individual changes +chained together by intents, ensuring that journal recovery can restart and +finish an operation that was only partially done when the system stopped +functioning. + +The reason for these differences is to keep the amount of log space and CPU time +required to process objects being modified as small as possible and hence the +logging overhead as low as possible. Some items are very frequently modified, +and some parts of objects are more frequently modified than others, so keeping +the overhead of metadata logging low is of prime importance. + +The method used to log an item or chain modifications together isn't +particularly important in the scope of this document. It suffices to know that +the method used for logging a particular object or chaining modifications +together are different and are dependent on the object and/or modification being +performed. The logging subsystem only cares that certain specific rules are +followed to guarantee forwards progress and prevent deadlocks. + + +Transactions in XFS +=================== + +XFS has two types of high level transactions, defined by the type of log space +reservation they take. These are known as "one shot" and "permanent" +transactions. Permanent transaction reservations can take reservations that span +commit boundaries, whilst "one shot" transactions are for a single atomic +modification. + +The type and size of reservation must be matched to the modification taking +place. This means that permanent transactions can be used for one-shot +modifications, but one-shot reservations cannot be used for permanent +transactions. + +In the code, a one-shot transaction pattern looks somewhat like this:: + + tp = xfs_trans_alloc(<reservation>) + <lock items> + <join item to transaction> + <do modification> + xfs_trans_commit(tp); + +As items are modified in the transaction, the dirty regions in those items are +tracked via the transaction handle. Once the transaction is committed, all +resources joined to it are released, along with the remaining unused reservation +space that was taken at the transaction allocation time. + +In contrast, a permanent transaction is made up of multiple linked individual +transactions, and the pattern looks like this:: + + tp = xfs_trans_alloc(<reservation>) + xfs_ilock(ip, XFS_ILOCK_EXCL) + + loop { + xfs_trans_ijoin(tp, 0); + <do modification> + xfs_trans_log_inode(tp, ip); + xfs_trans_roll(&tp); + } + + xfs_trans_commit(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + +While this might look similar to a one-shot transaction, there is an important +difference: xfs_trans_roll() performs a specific operation that links two +transactions together:: + + ntp = xfs_trans_dup(tp); + xfs_trans_commit(tp); + xfs_trans_reserve(ntp); + +This results in a series of "rolling transactions" where the inode is locked +across the entire chain of transactions. Hence while this series of rolling +transactions is running, nothing else can read from or write to the inode and +this provides a mechanism for complex changes to appear atomic from an external +observer's point of view. + +It is important to note that a series of rolling transactions in a permanent +transaction does not form an atomic change in the journal. While each +individual modification is atomic, the chain is *not atomic*. If we crash half +way through, then recovery will only replay up to the last transactional +modification the loop made that was committed to the journal. + +This affects long running permanent transactions in that it is not possible to +predict how much of a long running operation will actually be recovered because +there is no guarantee of how much of the operation reached stale storage. Hence +if a long running operation requires multiple transactions to fully complete, +the high level operation must use intents and deferred operations to guarantee +recovery can complete the operation once the first transactions is persisted in +the on-disk journal. + + +Transactions are Asynchronous +============================= + +In XFS, all high level transactions are asynchronous by default. This means that +xfs_trans_commit() does not guarantee that the modification has been committed +to stable storage when it returns. Hence when a system crashes, not all the +completed transactions will be replayed during recovery. + +However, the logging subsystem does provide global ordering guarantees, such +that if a specific change is seen after recovery, all metadata modifications +that were committed prior to that change will also be seen. + +For single shot operations that need to reach stable storage immediately, or +ensuring that a long running permanent transaction is fully committed once it is +complete, we can explicitly tag a transaction as synchronous. This will trigger +a "log force" to flush the outstanding committed transactions to stable storage +in the journal and wait for that to complete. + +Synchronous transactions are rarely used, however, because they limit logging +throughput to the IO latency limitations of the underlying storage. Instead, we +tend to use log forces to ensure modifications are on stable storage only when +a user operation requires a synchronisation point to occur (e.g. fsync). + + +Transaction Reservations +======================== + +It has been mentioned a number of times now that the logging subsystem needs to +provide a forwards progress guarantee so that no modification ever stalls +because it can't be written to the journal due to a lack of space in the +journal. This is achieved by the transaction reservations that are made when +a transaction is first allocated. For permanent transactions, these reservations +are maintained as part of the transaction rolling mechanism. + +A transaction reservation provides a guarantee that there is physical log space +available to write the modification into the journal before we start making +modifications to objects and items. As such, the reservation needs to be large +enough to take into account the amount of metadata that the change might need to +log in the worst case. This means that if we are modifying a btree in the +transaction, we have to reserve enough space to record a full leaf-to-root split +of the btree. As such, the reservations are quite complex because we have to +take into account all the hidden changes that might occur. + +For example, a user data extent allocation involves allocating an extent from +free space, which modifies the free space trees. That's two btrees. Inserting +the extent into the inode's extent map might require a split of the extent map +btree, which requires another allocation that can modify the free space trees +again. Then we might have to update reverse mappings, which modifies yet +another btree which might require more space. And so on. Hence the amount of +metadata that a "simple" operation can modify can be quite large. + +This "worst case" calculation provides us with the static "unit reservation" +for the transaction that is calculated at mount time. We must guarantee that the +log has this much space available before the transaction is allowed to proceed +so that when we come to write the dirty metadata into the log we don't run out +of log space half way through the write. + +For one-shot transactions, a single unit space reservation is all that is +required for the transaction to proceed. For permanent transactions, however, we +also have a "log count" that affects the size of the reservation that is to be +made. + +While a permanent transaction can get by with a single unit of space +reservation, it is somewhat inefficient to do this as it requires the +transaction rolling mechanism to re-reserve space on every transaction roll. We +know from the implementation of the permanent transactions how many transaction +rolls are likely for the common modifications that need to be made. + +For example, an inode allocation is typically two transactions - one to +physically allocate a free inode chunk on disk, and another to allocate an inode +from an inode chunk that has free inodes in it. Hence for an inode allocation +transaction, we might set the reservation log count to a value of 2 to indicate +that the common/fast path transaction will commit two linked transactions in a +chain. Each time a permanent transaction rolls, it consumes an entire unit +reservation. + +Hence when the permanent transaction is first allocated, the log space +reservation is increased from a single unit reservation to multiple unit +reservations. That multiple is defined by the reservation log count, and this +means we can roll the transaction multiple times before we have to re-reserve +log space when we roll the transaction. This ensures that the common +modifications we make only need to reserve log space once. + +If the log count for a permanent transaction reaches zero, then it needs to +re-reserve physical space in the log. This is somewhat complex, and requires +an understanding of how the log accounts for space that has been reserved. + + +Log Space Accounting +==================== + +The position in the log is typically referred to as a Log Sequence Number (LSN). +The log is circular, so the positions in the log are defined by the combination +of a cycle number - the number of times the log has been overwritten - and the +offset into the log. A LSN carries the cycle in the upper 32 bits and the +offset in the lower 32 bits. The offset is in units of "basic blocks" (512 +bytes). Hence we can do realtively simple LSN based math to keep track of +available space in the log. + +Log space accounting is done via a pair of constructs called "grant heads". The +position of the grant heads is an absolute value, so the amount of space +available in the log is defined by the distance between the position of the +grant head and the current log tail. That is, how much space can be +reserved/consumed before the grant heads would fully wrap the log and overtake +the tail position. + +The first grant head is the "reserve" head. This tracks the byte count of the +reservations currently held by active transactions. It is a purely in-memory +accounting of the space reservation and, as such, actually tracks byte offsets +into the log rather than basic blocks. Hence it technically isn't using LSNs to +represent the log position, but it is still treated like a split {cycle,offset} +tuple for the purposes of tracking reservation space. + +The reserve grant head is used to accurately account for exact transaction +reservations amounts and the exact byte count that modifications actually make +and need to write into the log. The reserve head is used to prevent new +transactions from taking new reservations when the head reaches the current +tail. It will block new reservations in a FIFO queue and as the log tail moves +forward it will wake them in order once sufficient space is available. This FIFO +mechanism ensures no transaction is starved of resources when log space +shortages occur. + +The other grant head is the "write" head. Unlike the reserve head, this grant +head contains an LSN and it tracks the physical space usage in the log. While +this might sound like it is accounting the same state as the reserve grant head +- and it mostly does track exactly the same location as the reserve grant head - +there are critical differences in behaviour between them that provides the +forwards progress guarantees that rolling permanent transactions require. + +These differences when a permanent transaction is rolled and the internal "log +count" reaches zero and the initial set of unit reservations have been +exhausted. At this point, we still require a log space reservation to continue +the next transaction in the sequeunce, but we have none remaining. We cannot +sleep during the transaction commit process waiting for new log space to become +available, as we may end up on the end of the FIFO queue and the items we have +locked while we sleep could end up pinning the tail of the log before there is +enough free space in the log to fulfill all of the pending reservations and +then wake up transaction commit in progress. + +To take a new reservation without sleeping requires us to be able to take a +reservation even if there is no reservation space currently available. That is, +we need to be able to *overcommit* the log reservation space. As has already +been detailed, we cannot overcommit physical log space. However, the reserve +grant head does not track physical space - it only accounts for the amount of +reservations we currently have outstanding. Hence if the reserve head passes +over the tail of the log all it means is that new reservations will be throttled +immediately and remain throttled until the log tail is moved forward far enough +to remove the overcommit and start taking new reservations. In other words, we +can overcommit the reserve head without violating the physical log head and tail +rules. + +As a result, permanent transactions only "regrant" reservation space during +xfs_trans_commit() calls, while the physical log space reservation - tracked by +the write head - is then reserved separately by a call to xfs_log_reserve() +after the commit completes. Once the commit completes, we can sleep waiting for +physical log space to be reserved from the write grant head, but only if one +critical rule has been observed:: + + Code using permanent reservations must always log the items they hold + locked across each transaction they roll in the chain. + +"Re-logging" the locked items on every transaction roll ensures that the items +attached to the transaction chain being rolled are always relocated to the +physical head of the log and so do not pin the tail of the log. If a locked item +pins the tail of the log when we sleep on the write reservation, then we will +deadlock the log as we cannot take the locks needed to write back that item and +move the tail of the log forwards to free up write grant space. Re-logging the +locked items avoids this deadlock and guarantees that the log reservation we are +making cannot self-deadlock. + +If all rolling transactions obey this rule, then they can all make forwards +progress independently because nothing will block the progress of the log +tail moving forwards and hence ensuring that write grant space is always +(eventually) made available to permanent transactions no matter how many times +they roll. + + +Re-logging Explained +==================== + +XFS allows multiple separate modifications to a single object to be carried in +the log at any given time. This allows the log to avoid needing to flush each +change to disk before recording a new change to the object. XFS does this via a +method called "re-logging". Conceptually, this is quite simple - all it requires +is that any new change to the object is recorded with a *new copy* of all the +existing changes in the new transaction that is written to the log. + +That is, if we have a sequence of changes A through to F, and the object was +written to disk after change D, we would see in the log the following series +of transactions, their contents and the log sequence number (LSN) of the +transaction:: + + Transaction Contents LSN + A A X + B A+B X+n + C A+B+C X+n+m + D A+B+C+D X+n+m+o + <object written to disk> + E E Y (> X+n+m+o) + F E+F Y+p + +In other words, each time an object is relogged, the new transaction contains +the aggregation of all the previous changes currently held only in the log. + +This relogging technique allows objects to be moved forward in the log so that +an object being relogged does not prevent the tail of the log from ever moving +forward. This can be seen in the table above by the changing (increasing) LSN +of each subsequent transaction, and it's the technique that allows us to +implement long-running, multiple-commit permanent transactions. + +A typical example of a rolling transaction is the removal of extents from an +inode which can only be done at a rate of two extents per transaction because +of reservation size limitations. Hence a rolling extent removal transaction +keeps relogging the inode and btree buffers as they get modified in each +removal operation. This keeps them moving forward in the log as the operation +progresses, ensuring that current operation never gets blocked by itself if the +log wraps around. + +Hence it can be seen that the relogging operation is fundamental to the correct +working of the XFS journalling subsystem. From the above description, most +people should be able to see why the XFS metadata operations writes so much to +the log - repeated operations to the same objects write the same changes to +the log over and over again. Worse is the fact that objects tend to get +dirtier as they get relogged, so each subsequent transaction is writing more +metadata into the log. + +It should now also be obvious how relogging and asynchronous transactions go +hand in hand. That is, transactions don't get written to the physical journal +until either a log buffer is filled (a log buffer can hold multiple +transactions) or a synchronous operation forces the log buffers holding the +transactions to disk. This means that XFS is doing aggregation of transactions +in memory - batching them, if you like - to minimise the impact of the log IO on +transaction throughput. + +The limitation on asynchronous transaction throughput is the number and size of +log buffers made available by the log manager. By default there are 8 log +buffers available and the size of each is 32kB - the size can be increased up +to 256kB by use of a mount option. + +Effectively, this gives us the maximum bound of outstanding metadata changes +that can be made to the filesystem at any point in time - if all the log +buffers are full and under IO, then no more transactions can be committed until +the current batch completes. It is now common for a single current CPU core to +be to able to issue enough transactions to keep the log buffers full and under +IO permanently. Hence the XFS journalling subsystem can be considered to be IO +bound. + +Delayed Logging: Concepts +========================= + +The key thing to note about the asynchronous logging combined with the +relogging technique XFS uses is that we can be relogging changed objects +multiple times before they are committed to disk in the log buffers. If we +return to the previous relogging example, it is entirely possible that +transactions A through D are committed to disk in the same log buffer. + +That is, a single log buffer may contain multiple copies of the same object, +but only one of those copies needs to be there - the last one "D", as it +contains all the changes from the previous changes. In other words, we have one +necessary copy in the log buffer, and three stale copies that are simply +wasting space. When we are doing repeated operations on the same set of +objects, these "stale objects" can be over 90% of the space used in the log +buffers. It is clear that reducing the number of stale objects written to the +log would greatly reduce the amount of metadata we write to the log, and this +is the fundamental goal of delayed logging. + +From a conceptual point of view, XFS is already doing relogging in memory (where +memory == log buffer), only it is doing it extremely inefficiently. It is using +logical to physical formatting to do the relogging because there is no +infrastructure to keep track of logical changes in memory prior to physically +formatting the changes in a transaction to the log buffer. Hence we cannot avoid +accumulating stale objects in the log buffers. + +Delayed logging is the name we've given to keeping and tracking transactional +changes to objects in memory outside the log buffer infrastructure. Because of +the relogging concept fundamental to the XFS journalling subsystem, this is +actually relatively easy to do - all the changes to logged items are already +tracked in the current infrastructure. The big problem is how to accumulate +them and get them to the log in a consistent, recoverable manner. +Describing the problems and how they have been solved is the focus of this +document. + +One of the key changes that delayed logging makes to the operation of the +journalling subsystem is that it disassociates the amount of outstanding +metadata changes from the size and number of log buffers available. In other +words, instead of there only being a maximum of 2MB of transaction changes not +written to the log at any point in time, there may be a much greater amount +being accumulated in memory. Hence the potential for loss of metadata on a +crash is much greater than for the existing logging mechanism. + +It should be noted that this does not change the guarantee that log recovery +will result in a consistent filesystem. What it does mean is that as far as the +recovered filesystem is concerned, there may be many thousands of transactions +that simply did not occur as a result of the crash. This makes it even more +important that applications that care about their data use fsync() where they +need to ensure application level data integrity is maintained. + +It should be noted that delayed logging is not an innovative new concept that +warrants rigorous proofs to determine whether it is correct or not. The method +of accumulating changes in memory for some period before writing them to the +log is used effectively in many filesystems including ext3 and ext4. Hence +no time is spent in this document trying to convince the reader that the +concept is sound. Instead it is simply considered a "solved problem" and as +such implementing it in XFS is purely an exercise in software engineering. + +The fundamental requirements for delayed logging in XFS are simple: + + 1. Reduce the amount of metadata written to the log by at least + an order of magnitude. + 2. Supply sufficient statistics to validate Requirement #1. + 3. Supply sufficient new tracing infrastructure to be able to debug + problems with the new code. + 4. No on-disk format change (metadata or log format). + 5. Enable and disable with a mount option. + 6. No performance regressions for synchronous transaction workloads. + +Delayed Logging: Design +======================= + +Storing Changes +--------------- + +The problem with accumulating changes at a logical level (i.e. just using the +existing log item dirty region tracking) is that when it comes to writing the +changes to the log buffers, we need to ensure that the object we are formatting +is not changing while we do this. This requires locking the object to prevent +concurrent modification. Hence flushing the logical changes to the log would +require us to lock every object, format them, and then unlock them again. + +This introduces lots of scope for deadlocks with transactions that are already +running. For example, a transaction has object A locked and modified, but needs +the delayed logging tracking lock to commit the transaction. However, the +flushing thread has the delayed logging tracking lock already held, and is +trying to get the lock on object A to flush it to the log buffer. This appears +to be an unsolvable deadlock condition, and it was solving this problem that +was the barrier to implementing delayed logging for so long. + +The solution is relatively simple - it just took a long time to recognise it. +Put simply, the current logging code formats the changes to each item into an +vector array that points to the changed regions in the item. The log write code +simply copies the memory these vectors point to into the log buffer during +transaction commit while the item is locked in the transaction. Instead of +using the log buffer as the destination of the formatting code, we can use an +allocated memory buffer big enough to fit the formatted vector. + +If we then copy the vector into the memory buffer and rewrite the vector to +point to the memory buffer rather than the object itself, we now have a copy of +the changes in a format that is compatible with the log buffer writing code. +that does not require us to lock the item to access. This formatting and +rewriting can all be done while the object is locked during transaction commit, +resulting in a vector that is transactionally consistent and can be accessed +without needing to lock the owning item. + +Hence we avoid the need to lock items when we need to flush outstanding +asynchronous transactions to the log. The differences between the existing +formatting method and the delayed logging formatting can be seen in the +diagram below. + +Current format log vector:: + + Object +---------------------------------------------+ + Vector 1 +----+ + Vector 2 +----+ + Vector 3 +----------+ + +After formatting:: + + Log Buffer +-V1-+-V2-+----V3----+ + +Delayed logging vector:: + + Object +---------------------------------------------+ + Vector 1 +----+ + Vector 2 +----+ + Vector 3 +----------+ + +After formatting:: + + Memory Buffer +-V1-+-V2-+----V3----+ + Vector 1 +----+ + Vector 2 +----+ + Vector 3 +----------+ + +The memory buffer and associated vector need to be passed as a single object, +but still need to be associated with the parent object so if the object is +relogged we can replace the current memory buffer with a new memory buffer that +contains the latest changes. + +The reason for keeping the vector around after we've formatted the memory +buffer is to support splitting vectors across log buffer boundaries correctly. +If we don't keep the vector around, we do not know where the region boundaries +are in the item, so we'd need a new encapsulation method for regions in the log +buffer writing (i.e. double encapsulation). This would be an on-disk format +change and as such is not desirable. It also means we'd have to write the log +region headers in the formatting stage, which is problematic as there is per +region state that needs to be placed into the headers during the log write. + +Hence we need to keep the vector, but by attaching the memory buffer to it and +rewriting the vector addresses to point at the memory buffer we end up with a +self-describing object that can be passed to the log buffer write code to be +handled in exactly the same manner as the existing log vectors are handled. +Hence we avoid needing a new on-disk format to handle items that have been +relogged in memory. + + +Tracking Changes +---------------- + +Now that we can record transactional changes in memory in a form that allows +them to be used without limitations, we need to be able to track and accumulate +them so that they can be written to the log at some later point in time. The +log item is the natural place to store this vector and buffer, and also makes sense +to be the object that is used to track committed objects as it will always +exist once the object has been included in a transaction. + +The log item is already used to track the log items that have been written to +the log but not yet written to disk. Such log items are considered "active" +and as such are stored in the Active Item List (AIL) which is a LSN-ordered +double linked list. Items are inserted into this list during log buffer IO +completion, after which they are unpinned and can be written to disk. An object +that is in the AIL can be relogged, which causes the object to be pinned again +and then moved forward in the AIL when the log buffer IO completes for that +transaction. + +Essentially, this shows that an item that is in the AIL can still be modified +and relogged, so any tracking must be separate to the AIL infrastructure. As +such, we cannot reuse the AIL list pointers for tracking committed items, nor +can we store state in any field that is protected by the AIL lock. Hence the +committed item tracking needs its own locks, lists and state fields in the log +item. + +Similar to the AIL, tracking of committed items is done through a new list +called the Committed Item List (CIL). The list tracks log items that have been +committed and have formatted memory buffers attached to them. It tracks objects +in transaction commit order, so when an object is relogged it is removed from +its place in the list and re-inserted at the tail. This is entirely arbitrary +and done to make it easy for debugging - the last items in the list are the +ones that are most recently modified. Ordering of the CIL is not necessary for +transactional integrity (as discussed in the next section) so the ordering is +done for convenience/sanity of the developers. + + +Delayed Logging: Checkpoints +---------------------------- + +When we have a log synchronisation event, commonly known as a "log force", +all the items in the CIL must be written into the log via the log buffers. +We need to write these items in the order that they exist in the CIL, and they +need to be written as an atomic transaction. The need for all the objects to be +written as an atomic transaction comes from the requirements of relogging and +log replay - all the changes in all the objects in a given transaction must +either be completely replayed during log recovery, or not replayed at all. If +a transaction is not replayed because it is not complete in the log, then +no later transactions should be replayed, either. + +To fulfill this requirement, we need to write the entire CIL in a single log +transaction. Fortunately, the XFS log code has no fixed limit on the size of a +transaction, nor does the log replay code. The only fundamental limit is that +the transaction cannot be larger than just under half the size of the log. The +reason for this limit is that to find the head and tail of the log, there must +be at least one complete transaction in the log at any given time. If a +transaction is larger than half the log, then there is the possibility that a +crash during the write of a such a transaction could partially overwrite the +only complete previous transaction in the log. This will result in a recovery +failure and an inconsistent filesystem and hence we must enforce the maximum +size of a checkpoint to be slightly less than a half the log. + +Apart from this size requirement, a checkpoint transaction looks no different +to any other transaction - it contains a transaction header, a series of +formatted log items and a commit record at the tail. From a recovery +perspective, the checkpoint transaction is also no different - just a lot +bigger with a lot more items in it. The worst case effect of this is that we +might need to tune the recovery transaction object hash size. + +Because the checkpoint is just another transaction and all the changes to log +items are stored as log vectors, we can use the existing log buffer writing +code to write the changes into the log. To do this efficiently, we need to +minimise the time we hold the CIL locked while writing the checkpoint +transaction. The current log write code enables us to do this easily with the +way it separates the writing of the transaction contents (the log vectors) from +the transaction commit record, but tracking this requires us to have a +per-checkpoint context that travels through the log write process through to +checkpoint completion. + +Hence a checkpoint has a context that tracks the state of the current +checkpoint from initiation to checkpoint completion. A new context is initiated +at the same time a checkpoint transaction is started. That is, when we remove +all the current items from the CIL during a checkpoint operation, we move all +those changes into the current checkpoint context. We then initialise a new +context and attach that to the CIL for aggregation of new transactions. + +This allows us to unlock the CIL immediately after transfer of all the +committed items and effectively allows new transactions to be issued while we +are formatting the checkpoint into the log. It also allows concurrent +checkpoints to be written into the log buffers in the case of log force heavy +workloads, just like the existing transaction commit code does. This, however, +requires that we strictly order the commit records in the log so that +checkpoint sequence order is maintained during log replay. + +To ensure that we can be writing an item into a checkpoint transaction at +the same time another transaction modifies the item and inserts the log item +into the new CIL, then checkpoint transaction commit code cannot use log items +to store the list of log vectors that need to be written into the transaction. +Hence log vectors need to be able to be chained together to allow them to be +detached from the log items. That is, when the CIL is flushed the memory +buffer and log vector attached to each log item needs to be attached to the +checkpoint context so that the log item can be released. In diagrammatic form, +the CIL would look like this before the flush:: + + CIL Head + | + V + Log Item <-> log vector 1 -> memory buffer + | -> vector array + V + Log Item <-> log vector 2 -> memory buffer + | -> vector array + V + ...... + | + V + Log Item <-> log vector N-1 -> memory buffer + | -> vector array + V + Log Item <-> log vector N -> memory buffer + -> vector array + +And after the flush the CIL head is empty, and the checkpoint context log +vector list would look like:: + + Checkpoint Context + | + V + log vector 1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector 2 -> memory buffer + | -> vector array + | -> Log Item + V + ...... + | + V + log vector N-1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector N -> memory buffer + -> vector array + -> Log Item + +Once this transfer is done, the CIL can be unlocked and new transactions can +start, while the checkpoint flush code works over the log vector chain to +commit the checkpoint. + +Once the checkpoint is written into the log buffers, the checkpoint context is +attached to the log buffer that the commit record was written to along with a +completion callback. Log IO completion will call that callback, which can then +run transaction committed processing for the log items (i.e. insert into AIL +and unpin) in the log vector chain and then free the log vector chain and +checkpoint context. + +Discussion Point: I am uncertain as to whether the log item is the most +efficient way to track vectors, even though it seems like the natural way to do +it. The fact that we walk the log items (in the CIL) just to chain the log +vectors and break the link between the log item and the log vector means that +we take a cache line hit for the log item list modification, then another for +the log vector chaining. If we track by the log vectors, then we only need to +break the link between the log item and the log vector, which means we should +dirty only the log item cachelines. Normally I wouldn't be concerned about one +vs two dirty cachelines except for the fact I've seen upwards of 80,000 log +vectors in one checkpoint transaction. I'd guess this is a "measure and +compare" situation that can be done after a working and reviewed implementation +is in the dev tree.... + +Delayed Logging: Checkpoint Sequencing +-------------------------------------- + +One of the key aspects of the XFS transaction subsystem is that it tags +committed transactions with the log sequence number of the transaction commit. +This allows transactions to be issued asynchronously even though there may be +future operations that cannot be completed until that transaction is fully +committed to the log. In the rare case that a dependent operation occurs (e.g. +re-using a freed metadata extent for a data extent), a special, optimised log +force can be issued to force the dependent transaction to disk immediately. + +To do this, transactions need to record the LSN of the commit record of the +transaction. This LSN comes directly from the log buffer the transaction is +written into. While this works just fine for the existing transaction +mechanism, it does not work for delayed logging because transactions are not +written directly into the log buffers. Hence some other method of sequencing +transactions is required. + +As discussed in the checkpoint section, delayed logging uses per-checkpoint +contexts, and as such it is simple to assign a sequence number to each +checkpoint. Because the switching of checkpoint contexts must be done +atomically, it is simple to ensure that each new context has a monotonically +increasing sequence number assigned to it without the need for an external +atomic counter - we can just take the current context sequence number and add +one to it for the new context. + +Then, instead of assigning a log buffer LSN to the transaction commit LSN +during the commit, we can assign the current checkpoint sequence. This allows +operations that track transactions that have not yet completed know what +checkpoint sequence needs to be committed before they can continue. As a +result, the code that forces the log to a specific LSN now needs to ensure that +the log forces to a specific checkpoint. + +To ensure that we can do this, we need to track all the checkpoint contexts +that are currently committing to the log. When we flush a checkpoint, the +context gets added to a "committing" list which can be searched. When a +checkpoint commit completes, it is removed from the committing list. Because +the checkpoint context records the LSN of the commit record for the checkpoint, +we can also wait on the log buffer that contains the commit record, thereby +using the existing log force mechanisms to execute synchronous forces. + +It should be noted that the synchronous forces may need to be extended with +mitigation algorithms similar to the current log buffer code to allow +aggregation of multiple synchronous transactions if there are already +synchronous transactions being flushed. Investigation of the performance of the +current design is needed before making any decisions here. + +The main concern with log forces is to ensure that all the previous checkpoints +are also committed to disk before the one we need to wait for. Therefore we +need to check that all the prior contexts in the committing list are also +complete before waiting on the one we need to complete. We do this +synchronisation in the log force code so that we don't need to wait anywhere +else for such serialisation - it only matters when we do a log force. + +The only remaining complexity is that a log force now also has to handle the +case where the forcing sequence number is the same as the current context. That +is, we need to flush the CIL and potentially wait for it to complete. This is a +simple addition to the existing log forcing code to check the sequence numbers +and push if required. Indeed, placing the current sequence checkpoint flush in +the log force code enables the current mechanism for issuing synchronous +transactions to remain untouched (i.e. commit an asynchronous transaction, then +force the log at the LSN of that transaction) and so the higher level code +behaves the same regardless of whether delayed logging is being used or not. + +Delayed Logging: Checkpoint Log Space Accounting +------------------------------------------------ + +The big issue for a checkpoint transaction is the log space reservation for the +transaction. We don't know how big a checkpoint transaction is going to be +ahead of time, nor how many log buffers it will take to write out, nor the +number of split log vector regions are going to be used. We can track the +amount of log space required as we add items to the commit item list, but we +still need to reserve the space in the log for the checkpoint. + +A typical transaction reserves enough space in the log for the worst case space +usage of the transaction. The reservation accounts for log record headers, +transaction and region headers, headers for split regions, buffer tail padding, +etc. as well as the actual space for all the changed metadata in the +transaction. While some of this is fixed overhead, much of it is dependent on +the size of the transaction and the number of regions being logged (the number +of log vectors in the transaction). + +An example of the differences would be logging directory changes versus logging +inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then +there are lots of transactions that only contain an inode core and an inode log +format structure. That is, two vectors totaling roughly 150 bytes. If we modify +10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each +vector is 12 bytes, so the total to be logged is approximately 1.75MB. In +comparison, if we are logging full directory buffers, they are typically 4KB +each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a +buffer format structure for each buffer - roughly 800 vectors or 1.51MB total +space. From this, it should be obvious that a static log space reservation is +not particularly flexible and is difficult to select the "optimal value" for +all workloads. + +Further, if we are going to use a static reservation, which bit of the entire +reservation does it cover? We account for space used by the transaction +reservation by tracking the space currently used by the object in the CIL and +then calculating the increase or decrease in space used as the object is +relogged. This allows for a checkpoint reservation to only have to account for +log buffer metadata used such as log header records. + +However, even using a static reservation for just the log metadata is +problematic. Typically log record headers use at least 16KB of log space per +1MB of log space consumed (512 bytes per 32k) and the reservation needs to be +large enough to handle arbitrary sized checkpoint transactions. This +reservation needs to be made before the checkpoint is started, and we need to +be able to reserve the space without sleeping. For a 8MB checkpoint, we need a +reservation of around 150KB, which is a non-trivial amount of space. + +A static reservation needs to manipulate the log grant counters - we can take a +permanent reservation on the space, but we still need to make sure we refresh +the write reservation (the actual space available to the transaction) after +every checkpoint transaction completion. Unfortunately, if this space is not +available when required, then the regrant code will sleep waiting for it. + +The problem with this is that it can lead to deadlocks as we may need to commit +checkpoints to be able to free up log space (refer back to the description of +rolling transactions for an example of this). Hence we *must* always have +space available in the log if we are to use static reservations, and that is +very difficult and complex to arrange. It is possible to do, but there is a +simpler way. + +The simpler way of doing this is tracking the entire log space used by the +items in the CIL and using this to dynamically calculate the amount of log +space required by the log metadata. If this log metadata space changes as a +result of a transaction commit inserting a new memory buffer into the CIL, then +the difference in space required is removed from the transaction that causes +the change. Transactions at this level will *always* have enough space +available in their reservation for this as they have already reserved the +maximal amount of log metadata space they require, and such a delta reservation +will always be less than or equal to the maximal amount in the reservation. + +Hence we can grow the checkpoint transaction reservation dynamically as items +are added to the CIL and avoid the need for reserving and regranting log space +up front. This avoids deadlocks and removes a blocking point from the +checkpoint flush code. + +As mentioned early, transactions can't grow to more than half the size of the +log. Hence as part of the reservation growing, we need to also check the size +of the reservation against the maximum allowed transaction size. If we reach +the maximum threshold, we need to push the CIL to the log. This is effectively +a "background flush" and is done on demand. This is identical to +a CIL push triggered by a log force, only that there is no waiting for the +checkpoint commit to complete. This background push is checked and executed by +transaction commit code. + +If the transaction subsystem goes idle while we still have items in the CIL, +they will be flushed by the periodic log force issued by the xfssyncd. This log +force will push the CIL to disk, and if the transaction subsystem stays idle, +allow the idle log to be covered (effectively marked clean) in exactly the same +manner that is done for the existing logging method. A discussion point is +whether this log force needs to be done more frequently than the current rate +which is once every 30s. + + +Delayed Logging: Log Item Pinning +--------------------------------- + +Currently log items are pinned during transaction commit while the items are +still locked. This happens just after the items are formatted, though it could +be done any time before the items are unlocked. The result of this mechanism is +that items get pinned once for every transaction that is committed to the log +buffers. Hence items that are relogged in the log buffers will have a pin count +for every outstanding transaction they were dirtied in. When each of these +transactions is completed, they will unpin the item once. As a result, the item +only becomes unpinned when all the transactions complete and there are no +pending transactions. Thus the pinning and unpinning of a log item is symmetric +as there is a 1:1 relationship with transaction commit and log item completion. + +For delayed logging, however, we have an asymmetric transaction commit to +completion relationship. Every time an object is relogged in the CIL it goes +through the commit process without a corresponding completion being registered. +That is, we now have a many-to-one relationship between transaction commit and +log item completion. The result of this is that pinning and unpinning of the +log items becomes unbalanced if we retain the "pin on transaction commit, unpin +on transaction completion" model. + +To keep pin/unpin symmetry, the algorithm needs to change to a "pin on +insertion into the CIL, unpin on checkpoint completion". In other words, the +pinning and unpinning becomes symmetric around a checkpoint context. We have to +pin the object the first time it is inserted into the CIL - if it is already in +the CIL during a transaction commit, then we do not pin it again. Because there +can be multiple outstanding checkpoint contexts, we can still see elevated pin +counts, but as each checkpoint completes the pin count will retain the correct +value according to its context. + +Just to make matters slightly more complex, this checkpoint level context +for the pin count means that the pinning of an item must take place under the +CIL commit/flush lock. If we pin the object outside this lock, we cannot +guarantee which context the pin count is associated with. This is because of +the fact pinning the item is dependent on whether the item is present in the +current CIL or not. If we don't pin the CIL first before we check and pin the +object, we have a race with CIL being flushed between the check and the pin +(or not pinning, as the case may be). Hence we must hold the CIL flush/commit +lock to guarantee that we pin the items correctly. + +Delayed Logging: Concurrent Scalability +--------------------------------------- + +A fundamental requirement for the CIL is that accesses through transaction +commits must scale to many concurrent commits. The current transaction commit +code does not break down even when there are transactions coming from 2048 +processors at once. The current transaction code does not go any faster than if +there was only one CPU using it, but it does not slow down either. + +As a result, the delayed logging transaction commit code needs to be designed +for concurrency from the ground up. It is obvious that there are serialisation +points in the design - the three important ones are: + + 1. Locking out new transaction commits while flushing the CIL + 2. Adding items to the CIL and updating item space accounting + 3. Checkpoint commit ordering + +Looking at the transaction commit and CIL flushing interactions, it is clear +that we have a many-to-one interaction here. That is, the only restriction on +the number of concurrent transactions that can be trying to commit at once is +the amount of space available in the log for their reservations. The practical +limit here is in the order of several hundred concurrent transactions for a +128MB log, which means that it is generally one per CPU in a machine. + +The amount of time a transaction commit needs to hold out a flush is a +relatively long period of time - the pinning of log items needs to be done +while we are holding out a CIL flush, so at the moment that means it is held +across the formatting of the objects into memory buffers (i.e. while memcpy()s +are in progress). Ultimately a two pass algorithm where the formatting is done +separately to the pinning of objects could be used to reduce the hold time of +the transaction commit side. + +Because of the number of potential transaction commit side holders, the lock +really needs to be a sleeping lock - if the CIL flush takes the lock, we do not +want every other CPU in the machine spinning on the CIL lock. Given that +flushing the CIL could involve walking a list of tens of thousands of log +items, it will get held for a significant time and so spin contention is a +significant concern. Preventing lots of CPUs spinning doing nothing is the +main reason for choosing a sleeping lock even though nothing in either the +transaction commit or CIL flush side sleeps with the lock held. + +It should also be noted that CIL flushing is also a relatively rare operation +compared to transaction commit for asynchronous transaction workloads - only +time will tell if using a read-write semaphore for exclusion will limit +transaction commit concurrency due to cache line bouncing of the lock on the +read side. + +The second serialisation point is on the transaction commit side where items +are inserted into the CIL. Because transactions can enter this code +concurrently, the CIL needs to be protected separately from the above +commit/flush exclusion. It also needs to be an exclusive lock but it is only +held for a very short time and so a spin lock is appropriate here. It is +possible that this lock will become a contention point, but given the short +hold time once per transaction I think that contention is unlikely. + +The final serialisation point is the checkpoint commit record ordering code +that is run as part of the checkpoint commit and log force sequencing. The code +path that triggers a CIL flush (i.e. whatever triggers the log force) will enter +an ordering loop after writing all the log vectors into the log buffers but +before writing the commit record. This loop walks the list of committing +checkpoints and needs to block waiting for checkpoints to complete their commit +record write. As a result it needs a lock and a wait variable. Log force +sequencing also requires the same lock, list walk, and blocking mechanism to +ensure completion of checkpoints. + +These two sequencing operations can use the mechanism even though the +events they are waiting for are different. The checkpoint commit record +sequencing needs to wait until checkpoint contexts contain a commit LSN +(obtained through completion of a commit record write) while log force +sequencing needs to wait until previous checkpoint contexts are removed from +the committing list (i.e. they've completed). A simple wait variable and +broadcast wakeups (thundering herds) has been used to implement these two +serialisation queues. They use the same lock as the CIL, too. If we see too +much contention on the CIL lock, or too many context switches as a result of +the broadcast wakeups these operations can be put under a new spinlock and +given separate wait lists to reduce lock contention and the number of processes +woken by the wrong event. + + +Lifecycle Changes +----------------- + +The existing log item life cycle is as follows:: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory + Format item into log buffer + Write commit LSN into transaction + Unlock item + Attach transaction to log buffer + + <log buffer IO dispatched> + <log buffer IO completes> + + 7. Transaction completion + Mark log item committed + Insert log item into AIL + Write commit LSN into log item + Unpin log item + 8. AIL traversal + Lock item + Mark log item clean + Flush item to disk + + <item IO completion> + + 9. Log item removed from AIL + Moves log tail + Item unlocked + +Essentially, steps 1-6 operate independently from step 7, which is also +independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 +at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur +at the same time. If the log item is in the AIL or between steps 6 and 7 +and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 +are entered and completed is the object considered clean. + +With delayed logging, there are new steps inserted into the life cycle:: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory if not pinned in CIL + Format item into log vector + buffer + Attach log vector and buffer to log item + Insert log item into CIL + Write CIL context sequence into transaction + Unlock item + + <next log force> + + 7. CIL push + lock CIL flush + Chain log vectors and buffers together + Remove items from CIL + unlock CIL flush + write log vectors into log + sequence commit records + attach checkpoint context to log buffer + + <log buffer IO dispatched> + <log buffer IO completes> + + 8. Checkpoint completion + Mark log item committed + Insert item into AIL + Write commit LSN into log item + Unpin log item + 9. AIL traversal + Lock item + Mark log item clean + Flush item to disk + <item IO completion> + 10. Log item removed from AIL + Moves log tail + Item unlocked + +From this, it can be seen that the only life cycle differences between the two +logging methods are in the middle of the life cycle - they still have the same +beginning and end and execution constraints. The only differences are in the +committing of the log items to the log itself and the completion processing. +Hence delayed logging should not introduce any constraints on log item +behaviour, allocation or freeing that don't already exist. + +As a result of this zero-impact "insertion" of delayed logging infrastructure +and the design of the internal structures to avoid on disk format changes, we +can basically switch between delayed logging and the existing mechanism with a +mount option. Fundamentally, there is no reason why the log manager would not +be able to swap methods automatically and transparently depending on load +characteristics, but this should not be necessary if delayed logging works as +designed. diff --git a/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst new file mode 100644 index 000000000000..32b6ac4ca9d6 --- /dev/null +++ b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst @@ -0,0 +1,194 @@ +XFS Maintainer Entry Profile +============================ + +Overview +-------- +XFS is a well known high-performance filesystem in the Linux kernel. +The aim of this project is to provide and maintain a robust and +performant filesystem. + +Patches are generally merged to the for-next branch of the appropriate +git repository. +After a testing period, the for-next branch is merged to the master +branch. + +Kernel code are merged to the xfs-linux tree[0]. +Userspace code are merged to the xfsprogs tree[1]. +Test cases are merged to the xfstests tree[2]. +Ondisk format documentation are merged to the xfs-documentation tree[3]. + +All patchsets involving XFS *must* be cc'd in their entirety to the mailing +list linux-xfs@vger.kernel.org. + +Roles +----- +There are eight key roles in the XFS project. +A person can take on multiple roles, and a role can be filled by +multiple people. +Anyone taking on a role is advised to check in with themselves and +others on a regular basis about burnout. + +- **Outside Contributor**: Anyone who sends a patch but is not involved + in the XFS project on a regular basis. + These folks are usually people who work on other filesystems or + elsewhere in the kernel community. + +- **Developer**: Someone who is familiar with the XFS codebase enough to + write new code, documentation, and tests. + + Developers can often be found in the IRC channel mentioned by the ``C:`` + entry in the kernel MAINTAINERS file. + +- **Senior Developer**: A developer who is very familiar with at least + some part of the XFS codebase and/or other subsystems in the kernel. + These people collectively decide the long term goals of the project + and nudge the community in that direction. + They should help prioritize development and review work for each release + cycle. + + Senior developers tend to be more active participants in the IRC channel. + +- **Reviewer**: Someone (most likely also a developer) who reads code + submissions to decide: + + 0. Is the idea behind the contribution sound? + 1. Does the idea fit the goals of the project? + 2. Is the contribution designed correctly? + 3. Is the contribution polished? + 4. Can the contribution be tested effectively? + + Reviewers should identify themselves with an ``R:`` entry in the kernel + and fstests MAINTAINERS files. + +- **Testing Lead**: This person is responsible for setting the test + coverage goals of the project, negotiating with developers to decide + on new tests for new features, and making sure that developers and + release managers execute on the testing. + + The testing lead should identify themselves with an ``M:`` entry in + the XFS section of the fstests MAINTAINERS file. + +- **Bug Triager**: Someone who examines incoming bug reports in just + enough detail to identify the person to whom the report should be + forwarded. + + The bug triagers should identify themselves with a ``B:`` entry in + the kernel MAINTAINERS file. + +- **Release Manager**: This person merges reviewed patchsets into an + integration branch, tests the result locally, pushes the branch to a + public git repository, and sends pull requests further upstream. + The release manager is not expected to work on new feature patchsets. + If a developer and a reviewer fail to reach a resolution on some point, + the release manager must have the ability to intervene to try to drive a + resolution. + + The release manager should identify themselves with an ``M:`` entry in + the kernel MAINTAINERS file. + +- **Community Manager**: This person calls and moderates meetings of as many + XFS participants as they can get when mailing list discussions prove + insufficient for collective decisionmaking. + They may also serve as liaison between managers of the organizations + sponsoring work on any part of XFS. + +- **LTS Maintainer**: Someone who backports and tests bug fixes from + uptream to the LTS kernels. + There tend to be six separate LTS trees at any given time. + + The maintainer for a given LTS release should identify themselves with an + ``M:`` entry in the MAINTAINERS file for that LTS tree. + Unmaintained LTS kernels should be marked with status ``S: Orphan`` in that + same file. + +Submission Checklist Addendum +----------------------------- +Please follow these additional rules when submitting to XFS: + +- Patches affecting only the filesystem itself should be based against + the latest -rc or the for-next branch. + These patches will be merged back to the for-next branch. + +- Authors of patches touching other subsystems need to coordinate with + the maintainers of XFS and the relevant subsystems to decide how to + proceed with a merge. + +- Any patchset changing XFS should be cc'd in its entirety to linux-xfs. + Do not send partial patchsets; that makes analysis of the broader + context of the changes unnecessarily difficult. + +- Anyone making kernel changes that have corresponding changes to the + userspace utilities should send the userspace changes as separate + patchsets immediately after the kernel patchsets. + +- Authors of bug fix patches are expected to use fstests[2] to perform + an A/B test of the patch to determine that there are no regressions. + When possible, a new regression test case should be written for + fstests. + +- Authors of new feature patchsets must ensure that fstests will have + appropriate functional and input corner-case test cases for the new + feature. + +- When implementing a new feature, it is strongly suggested that the + developers write a design document to answer the following questions: + + * **What** problem is this trying to solve? + + * **Who** will benefit from this solution, and **where** will they + access it? + + * **How** will this new feature work? This should touch on major data + structures and algorithms supporting the solution at a higher level + than code comments. + + * **What** userspace interfaces are necessary to build off of the new + features? + + * **How** will this work be tested to ensure that it solves the + problems laid out in the design document without causing new + problems? + + The design document should be committed in the kernel documentation + directory. + It may be omitted if the feature is already well known to the + community. + +- Patchsets for the new tests should be submitted as separate patchsets + immediately after the kernel and userspace code patchsets. + +- Changes to the on-disk format of XFS must be described in the ondisk + format document[3] and submitted as a patchset after the fstests + patchsets. + +- Patchsets implementing bug fixes and further code cleanups should put + the bug fixes at the beginning of the series to ease backporting. + +Key Release Cycle Dates +----------------------- +Bug fixes may be sent at any time, though the release manager may decide to +defer a patch when the next merge window is close. + +Code submissions targeting the next merge window should be sent between +-rc1 and -rc6. +This gives the community time to review the changes, to suggest other changes, +and for the author to retest those changes. + +Code submissions also requiring changes to fs/iomap and targeting the +next merge window should be sent between -rc1 and -rc4. +This allows the broader kernel community adequate time to test the +infrastructure changes. + +Review Cadence +-------------- +In general, please wait at least one week before pinging for feedback. +To find reviewers, either consult the MAINTAINERS file, or ask +developers that have Reviewed-by tags for XFS changes to take a look and +offer their opinion. + +References +---------- +| [0] https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/ +| [1] https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/ +| [2] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/ +| [3] https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/ diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst new file mode 100644 index 000000000000..352516feef6f --- /dev/null +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -0,0 +1,5315 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _xfs_online_fsck_design: + +.. + Mapping of heading styles within this document: + Heading 1 uses "====" above and below + Heading 2 uses "====" + Heading 3 uses "----" + Heading 4 uses "````" + Heading 5 uses "^^^^" + Heading 6 uses "~~~~" + Heading 7 uses "...." + + Sections are manually numbered because apparently that's what everyone + does in the kernel. + +====================== +XFS Online Fsck Design +====================== + +This document captures the design of the online filesystem check feature for +XFS. +The purpose of this document is threefold: + +- To help kernel distributors understand exactly what the XFS online fsck + feature is, and issues about which they should be aware. + +- To help people reading the code to familiarize themselves with the relevant + concepts and design points before they start digging into the code. + +- To help developers maintaining the system by capturing the reasons + supporting higher level decision making. + +As the online fsck code is merged, the links in this document to topic branches +will be replaced with links to code. + +This document is licensed under the terms of the GNU Public License, v2. +The primary author is Darrick J. Wong. + +This design document is split into seven parts. +Part 1 defines what fsck tools are and the motivations for writing a new one. +Parts 2 and 3 present a high level overview of how online fsck process works +and how it is tested to ensure correct functionality. +Part 4 discusses the user interface and the intended usage modes of the new +program. +Parts 5 and 6 show off the high level components and how they fit together, and +then present case studies of how each repair function actually works. +Part 7 sums up what has been discussed so far and speculates about what else +might be built atop online fsck. + +.. contents:: Table of Contents + :local: + +1. What is a Filesystem Check? +============================== + +A Unix filesystem has four main responsibilities: + +- Provide a hierarchy of names through which application programs can associate + arbitrary blobs of data for any length of time, + +- Virtualize physical storage media across those names, and + +- Retrieve the named data blobs at any time. + +- Examine resource usage. + +Metadata directly supporting these functions (e.g. files, directories, space +mappings) are sometimes called primary metadata. +Secondary metadata (e.g. reverse mapping and directory parent pointers) support +operations internal to the filesystem, such as internal consistency checking +and reorganization. +Summary metadata, as the name implies, condense information contained in +primary metadata for performance reasons. + +The filesystem check (fsck) tool examines all the metadata in a filesystem +to look for errors. +In addition to looking for obvious metadata corruptions, fsck also +cross-references different types of metadata records with each other to look +for inconsistencies. +People do not like losing data, so most fsck tools also contains some ability +to correct any problems found. +As a word of caution -- the primary goal of most Linux fsck tools is to restore +the filesystem metadata to a consistent state, not to maximize the data +recovered. +That precedent will not be challenged here. + +Filesystems of the 20th century generally lacked any redundancy in the ondisk +format, which means that fsck can only respond to errors by erasing files until +errors are no longer detected. +More recent filesystem designs contain enough redundancy in their metadata that +it is now possible to regenerate data structures when non-catastrophic errors +occur; this capability aids both strategies. + ++--------------------------------------------------------------------------+ +| **Note**: | ++--------------------------------------------------------------------------+ +| System administrators avoid data loss by increasing the number of | +| separate storage systems through the creation of backups; and they avoid | +| downtime by increasing the redundancy of each storage system through the | +| creation of RAID arrays. | +| fsck tools address only the first problem. | ++--------------------------------------------------------------------------+ + +TLDR; Show Me the Code! +----------------------- + +Code is posted to the kernel.org git trees as follows: +`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_, +`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and +`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_. +Each kernel patchset adding an online repair function will use the same branch +name across the kernel, xfsprogs, and fstests git repos. + +Existing Tools +-------------- + +The online fsck tool described here will be the third tool in the history of +XFS (on Linux) to check and repair filesystems. +Two programs precede it: + +The first program, ``xfs_check``, was created as part of the XFS debugger +(``xfs_db``) and can only be used with unmounted filesystems. +It walks all metadata in the filesystem looking for inconsistencies in the +metadata, though it lacks any ability to repair what it finds. +Due to its high memory requirements and inability to repair things, this +program is now deprecated and will not be discussed further. + +The second program, ``xfs_repair``, was created to be faster and more robust +than the first program. +Like its predecessor, it can only be used with unmounted filesystems. +It uses extent-based in-memory data structures to reduce memory consumption, +and tries to schedule readahead IO appropriately to reduce I/O waiting time +while it scans the metadata of the entire filesystem. +The most important feature of this tool is its ability to respond to +inconsistencies in file metadata and directory tree by erasing things as needed +to eliminate problems. +Space usage metadata are rebuilt from the observed file metadata. + +Problem Statement +----------------- + +The current XFS tools leave several problems unsolved: + +1. **User programs** suddenly **lose access** to the filesystem when unexpected + shutdowns occur as a result of silent corruptions in the metadata. + These occur **unpredictably** and often without warning. + +2. **Users** experience a **total loss of service** during the recovery period + after an **unexpected shutdown** occurs. + +3. **Users** experience a **total loss of service** if the filesystem is taken + offline to **look for problems** proactively. + +4. **Data owners** cannot **check the integrity** of their stored data without + reading all of it. + This may expose them to substantial billing costs when a linear media scan + performed by the storage system administrator might suffice. + +5. **System administrators** cannot **schedule** a maintenance window to deal + with corruptions if they **lack the means** to assess filesystem health + while the filesystem is online. + +6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem + health when doing so requires **manual intervention** and downtime. + +7. **Users** can be tricked into **doing things they do not desire** when + malicious actors **exploit quirks of Unicode** to place misleading names + in directories. + +Given this definition of the problems to be solved and the actors who would +benefit, the proposed solution is a third fsck tool that acts on a running +filesystem. + +This new third program has three components: an in-kernel facility to check +metadata, an in-kernel facility to repair metadata, and a userspace driver +program to drive fsck activity on a live filesystem. +``xfs_scrub`` is the name of the driver program. +The rest of this document presents the goals and use cases of the new fsck +tool, describes its major design points in connection to those goals, and +discusses the similarities and differences with existing tools. + ++--------------------------------------------------------------------------+ +| **Note**: | ++--------------------------------------------------------------------------+ +| Throughout this document, the existing offline fsck tool can also be | +| referred to by its current name "``xfs_repair``". | +| The userspace driver program for the new online fsck tool can be | +| referred to as "``xfs_scrub``". | +| The kernel portion of online fsck that validates metadata is called | +| "online scrub", and portion of the kernel that fixes metadata is called | +| "online repair". | ++--------------------------------------------------------------------------+ + +The naming hierarchy is broken up into objects known as directories and files +and the physical space is split into pieces known as allocation groups. +Sharding enables better performance on highly parallel systems and helps to +contain the damage when corruptions occur. +The division of the filesystem into principal objects (allocation groups and +inodes) means that there are ample opportunities to perform targeted checks and +repairs on a subset of the filesystem. + +While this is going on, other parts continue processing IO requests. +Even if a piece of filesystem metadata can only be regenerated by scanning the +entire system, the scan can still be done in the background while other file +operations continue. + +In summary, online fsck takes advantage of resource sharding and redundant +metadata to enable targeted checking and repair operations while the system +is running. +This capability will be coupled to automatic system management so that +autonomous self-healing of XFS maximizes service availability. + +2. Theory of Operation +====================== + +Because it is necessary for online fsck to lock and scan live metadata objects, +online fsck consists of three separate code components. +The first is the userspace driver program ``xfs_scrub``, which is responsible +for identifying individual metadata items, scheduling work items for them, +reacting to the outcomes appropriately, and reporting results to the system +administrator. +The second and third are in the kernel, which implements functions to check +and repair each type of online fsck work item. + ++------------------------------------------------------------------+ +| **Note**: | ++------------------------------------------------------------------+ +| For brevity, this document shortens the phrase "online fsck work | +| item" to "scrub item". | ++------------------------------------------------------------------+ + +Scrub item types are delineated in a manner consistent with the Unix design +philosophy, which is to say that each item should handle one aspect of a +metadata structure, and handle it well. + +Scope +----- + +In principle, online fsck should be able to check and to repair everything that +the offline fsck program can handle. +However, online fsck cannot be running 100% of the time, which means that +latent errors may creep in after a scrub completes. +If these errors cause the next mount to fail, offline fsck is the only +solution. +This limitation means that maintenance of the offline fsck tool will continue. +A second limitation of online fsck is that it must follow the same resource +sharing and lock acquisition rules as the regular filesystem. +This means that scrub cannot take *any* shortcuts to save time, because doing +so could lead to concurrency problems. +In other words, online fsck is not a complete replacement for offline fsck, and +a complete run of online fsck may take longer than online fsck. +However, both of these limitations are acceptable tradeoffs to satisfy the +different motivations of online fsck, which are to **minimize system downtime** +and to **increase predictability of operation**. + +.. _scrubphases: + +Phases of Work +-------------- + +The userspace driver program ``xfs_scrub`` splits the work of checking and +repairing an entire filesystem into seven phases. +Each phase concentrates on checking specific types of scrub items and depends +on the success of all previous phases. +The seven phases are as follows: + +1. Collect geometry information about the mounted filesystem and computer, + discover the online fsck capabilities of the kernel, and open the + underlying storage devices. + +2. Check allocation group metadata, all realtime volume metadata, and all quota + files. + Each metadata structure is scheduled as a separate scrub item. + If corruption is found in the inode header or inode btree and ``xfs_scrub`` + is permitted to perform repairs, then those scrub items are repaired to + prepare for phase 3. + Repairs are implemented by using the information in the scrub item to + resubmit the kernel scrub call with the repair flag enabled; this is + discussed in the next section. + Optimizations and all other repairs are deferred to phase 4. + +3. Check all metadata of every file in the filesystem. + Each metadata structure is also scheduled as a separate scrub item. + If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, + and there were no problems detected during phase 2, then those scrub items + are repaired immediately. + Optimizations, deferred repairs, and unsuccessful repairs are deferred to + phase 4. + +4. All remaining repairs and scheduled optimizations are performed during this + phase, if the caller permits them. + Before starting repairs, the summary counters are checked and any necessary + repairs are performed so that subsequent repairs will not fail the resource + reservation step due to wildly incorrect summary counters. + Unsuccessful repairs are requeued as long as forward progress on repairs is + made somewhere in the filesystem. + Free space in the filesystem is trimmed at the end of phase 4 if the + filesystem is clean. + +5. By the start of this phase, all primary and secondary filesystem metadata + must be correct. + Summary counters such as the free space counts and quota resource counts + are checked and corrected. + Directory entry names and extended attribute names are checked for + suspicious entries such as control characters or confusing Unicode sequences + appearing in names. + +6. If the caller asks for a media scan, read all allocated and written data + file extents in the filesystem. + The ability to use hardware-assisted data file integrity checking is new + to online fsck; neither of the previous tools have this capability. + If media errors occur, they will be mapped to the owning files and reported. + +7. Re-check the summary counters and presents the caller with a summary of + space usage and file counts. + +This allocation of responsibilities will be :ref:`revisited <scrubcheck>` +later in this document. + +Steps for Each Scrub Item +------------------------- + +The kernel scrub code uses a three-step strategy for checking and repairing +the one aspect of a metadata object represented by a scrub item: + +1. The scrub item of interest is checked for corruptions; opportunities for + optimization; and for values that are directly controlled by the system + administrator but look suspicious. + If the item is not corrupt or does not need optimization, resource are + released and the positive scan results are returned to userspace. + If the item is corrupt or could be optimized but the caller does not permit + this, resources are released and the negative scan results are returned to + userspace. + Otherwise, the kernel moves on to the second step. + +2. The repair function is called to rebuild the data structure. + Repair functions generally choose rebuild a structure from other metadata + rather than try to salvage the existing structure. + If the repair fails, the scan results from the first step are returned to + userspace. + Otherwise, the kernel moves on to the third step. + +3. In the third step, the kernel runs the same checks over the new metadata + item to assess the efficacy of the repairs. + The results of the reassessment are returned to userspace. + +Classification of Metadata +-------------------------- + +Each type of metadata object (and therefore each type of scrub item) is +classified as follows: + +Primary Metadata +```````````````` + +Metadata structures in this category should be most familiar to filesystem +users either because they are directly created by the user or they index +objects created by the user +Most filesystem objects fall into this class: + +- Free space and reference count information + +- Inode records and indexes + +- Storage mapping information for file data + +- Directories + +- Extended attributes + +- Symbolic links + +- Quota limits + +Scrub obeys the same rules as regular filesystem accesses for resource and lock +acquisition. + +Primary metadata objects are the simplest for scrub to process. +The principal filesystem object (either an allocation group or an inode) that +owns the item being scrubbed is locked to guard against concurrent updates. +The check function examines every record associated with the type for obvious +errors and cross-references healthy records against other metadata to look for +inconsistencies. +Repairs for this class of scrub item are simple, since the repair function +starts by holding all the resources acquired in the previous step. +The repair function scans available metadata as needed to record all the +observations needed to complete the structure. +Next, it stages the observations in a new ondisk structure and commits it +atomically to complete the repair. +Finally, the storage from the old data structure are carefully reaped. + +Because ``xfs_scrub`` locks a primary object for the duration of the repair, +this is effectively an offline repair operation performed on a subset of the +filesystem. +This minimizes the complexity of the repair code because it is not necessary to +handle concurrent updates from other threads, nor is it necessary to access +any other part of the filesystem. +As a result, indexed structures can be rebuilt very quickly, and programs +trying to access the damaged structure will be blocked until repairs complete. +The only infrastructure needed by the repair code are the staging area for +observations and a means to write new structures to disk. +Despite these limitations, the advantage that online repair holds is clear: +targeted work on individual shards of the filesystem avoids total loss of +service. + +This mechanism is described in section 2.1 ("Off-Line Algorithm") of +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction +Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_, +*Extending Database Technology*, pp. 293-309, 1992. + +Most primary metadata repair functions stage their intermediate results in an +in-memory array prior to formatting the new ondisk structure, which is very +similar to the list-based algorithm discussed in section 2.3 ("List-Based +Algorithms") of Srinivasan. +However, any data structure builder that maintains a resource lock for the +duration of the repair is *always* an offline algorithm. + +.. _secondary_metadata: + +Secondary Metadata +`````````````````` + +Metadata structures in this category reflect records found in primary metadata, +but are only needed for online fsck or for reorganization of the filesystem. + +Secondary metadata include: + +- Reverse mapping information + +- Directory parent pointers + +This class of metadata is difficult for scrub to process because scrub attaches +to the secondary object but needs to check primary metadata, which runs counter +to the usual order of resource acquisition. +Frequently, this means that full filesystems scans are necessary to rebuild the +metadata. +Check functions can be limited in scope to reduce runtime. +Repairs, however, require a full scan of primary metadata, which can take a +long time to complete. +Under these conditions, ``xfs_scrub`` cannot lock resources for the entire +duration of the repair. + +Instead, repair functions set up an in-memory staging structure to store +observations. +Depending on the requirements of the specific repair function, the staging +index will either have the same format as the ondisk structure or a design +specific to that repair function. +The next step is to release all locks and start the filesystem scan. +When the repair scanner needs to record an observation, the staging data are +locked long enough to apply the update. +While the filesystem scan is in progress, the repair function hooks the +filesystem so that it can apply pending filesystem updates to the staging +information. +Once the scan is done, the owning object is re-locked, the live data is used to +write a new ondisk structure, and the repairs are committed atomically. +The hooks are disabled and the staging staging area is freed. +Finally, the storage from the old data structure are carefully reaped. + +Introducing concurrency helps online repair avoid various locking problems, but +comes at a high cost to code complexity. +Live filesystem code has to be hooked so that the repair function can observe +updates in progress. +The staging area has to become a fully functional parallel structure so that +updates can be merged from the hooks. +Finally, the hook, the filesystem scan, and the inode locking model must be +sufficiently well integrated that a hook event can decide if a given update +should be applied to the staging structure. + +In theory, the scrub implementation could apply these same techniques for +primary metadata, but doing so would make it massively more complex and less +performant. +Programs attempting to access the damaged structures are not blocked from +operation, which may cause application failure or an unplanned filesystem +shutdown. + +Inspiration for the secondary metadata repair strategy was drawn from section +2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") +and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for +Creating Indexes for Very Large Tables Without Quiescing Updates" +<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. + +The sidecar index mentioned above bears some resemblance to the side file +method mentioned in Srinivasan and Mohan. +Their method consists of an index builder that extracts relevant record data to +build the new structure as quickly as possible; and an auxiliary structure that +captures all updates that would be committed to the index by other threads were +the new index already online. +After the index building scan finishes, the updates recorded in the side file +are applied to the new index. +To avoid conflicts between the index builder and other writer threads, the +builder maintains a publicly visible cursor that tracks the progress of the +scan through the record space. +To avoid duplication of work between the side file and the index builder, side +file updates are elided when the record ID for the update is greater than the +cursor position within the record ID space. + +To minimize changes to the rest of the codebase, XFS online repair keeps the +replacement index hidden until it's completely ready to go. +In other words, there is no attempt to expose the keyspace of the new index +while repair is running. +The complexity of such an approach would be very high and perhaps more +appropriate to building *new* indices. + +**Future Work Question**: Can the full scan and live update code used to +facilitate a repair also be used to implement a comprehensive check? + +*Answer*: In theory, yes. Check would be much stronger if each scrub function +employed these live scans to build a shadow copy of the metadata and then +compared the shadow records to the ondisk records. +However, doing that is a fair amount more work than what the checking functions +do now. +The live scans and hooks were developed much later. +That in turn increases the runtime of those scrub functions. + +Summary Information +``````````````````` + +Metadata structures in this last category summarize the contents of primary +metadata records. +These are often used to speed up resource usage queries, and are many times +smaller than the primary metadata which they represent. + +Examples of summary information include: + +- Summary counts of free space and inodes + +- File link counts from directories + +- Quota resource usage counts + +Check and repair require full filesystem scans, but resource and lock +acquisition follow the same paths as regular filesystem accesses. + +The superblock summary counters have special requirements due to the underlying +implementation of the incore counters, and will be treated separately. +Check and repair of the other types of summary counters (quota resource counts +and file link counts) employ the same filesystem scanning and hooking +techniques as outlined above, but because the underlying data are sets of +integer counters, the staging data need not be a fully functional mirror of the +ondisk structure. + +Inspiration for quota and file link count repair strategies were drawn from +sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View +Maintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views +and Their Indexes" +<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011. + +Since quotas are non-negative integer counts of resource usage, online +quotacheck can use the incremental view deltas described in section 2.14 to +track pending changes to the block and inode usage counts in each transaction, +and commit those changes to a dquot side file when the transaction commits. +Delta tracking is necessary for dquots because the index builder scans inodes, +whereas the data structure being rebuilt is an index of dquots. +Link count checking combines the view deltas and commit step into one because +it sets attributes of the objects being scanned instead of writing them to a +separate data structure. +Each online fsck function will be discussed as case studies later in this +document. + +Risk Management +--------------- + +During the development of online fsck, several risk factors were identified +that may make the feature unsuitable for certain distributors and users. +Steps can be taken to mitigate or eliminate those risks, though at a cost to +functionality. + +- **Decreased performance**: Adding metadata indices to the filesystem + increases the time cost of persisting changes to disk, and the reverse space + mapping and directory parent pointers are no exception. + System administrators who require the maximum performance can disable the + reverse mapping features at format time, though this choice dramatically + reduces the ability of online fsck to find inconsistencies and repair them. + +- **Incorrect repairs**: As with all software, there might be defects in the + software that result in incorrect repairs being written to the filesystem. + Systematic fuzz testing (detailed in the next section) is employed by the + authors to find bugs early, but it might not catch everything. + The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` + and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to + accept this risk. + The xfsprogs build system has a configure option (``--enable-scrub=no``) that + disables building of the ``xfs_scrub`` binary, though this is not a risk + mitigation if the kernel functionality remains enabled. + +- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be + repairable. + If the keyspaces of several metadata indices overlap in some manner but a + coherent narrative cannot be formed from records collected, then the repair + fails. + To reduce the chance that a repair will fail with a dirty transaction and + render the filesystem unusable, the online repair functions have been + designed to stage and validate all new records before committing the new + structure. + +- **Misbehavior**: Online fsck requires many privileges -- raw IO to block + devices, opening files by handle, ignoring Unix discretionary access control, + and the ability to perform administrative changes. + Running this automatically in the background scares people, so the systemd + background service is configured to run with only the privileges required. + Obviously, this cannot address certain problems like the kernel crashing or + deadlocking, but it should be sufficient to prevent the scrub process from + escaping and reconfiguring the system. + The cron job does not have this protection. + +- **Fuzz Kiddiez**: There are many people now who seem to think that running + automated fuzz testing of ondisk artifacts to find mischievous behavior and + spraying exploit code onto the public mailing list for instant zero-day + disclosure is somehow of some social benefit. + In the view of this author, the benefit is realized only when the fuzz + operators help to **fix** the flaws, but this opinion apparently is not + widely shared among security "researchers". + The XFS maintainers' continuing ability to manage these events presents an + ongoing risk to the stability of the development process. + Automated testing should front-load some of the risk while the feature is + considered EXPERIMENTAL. + +Many of these risks are inherent to software programming. +Despite this, it is hoped that this new functionality will prove useful in +reducing unexpected downtime. + +3. Testing Plan +=============== + +As stated before, fsck tools have three main goals: + +1. Detect inconsistencies in the metadata; + +2. Eliminate those inconsistencies; and + +3. Minimize further loss of data. + +Demonstrations of correct operation are necessary to build users' confidence +that the software behaves within expectations. +Unfortunately, it was not really feasible to perform regular exhaustive testing +of every aspect of a fsck tool until the introduction of low-cost virtual +machines with high-IOPS storage. +With ample hardware availability in mind, the testing strategy for the online +fsck project involves differential analysis against the existing fsck tools and +systematic testing of every attribute of every type of metadata object. +Testing can be split into four major categories, as discussed below. + +Integrated Testing with fstests +------------------------------- + +The primary goal of any free software QA effort is to make testing as +inexpensive and widespread as possible to maximize the scaling advantages of +community. +In other words, testing should maximize the breadth of filesystem configuration +scenarios and hardware setups. +This improves code quality by enabling the authors of online fsck to find and +fix bugs early, and helps developers of new features to find integration +issues earlier in their development effort. + +The Linux filesystem community shares a common QA testing suite, +`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for +functional and regression testing. +Even before development work began on online fsck, fstests (when run on XFS) +would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and +scratch filesystems between each test. +This provides a level of assurance that the kernel and the fsck tools stay in +alignment about what constitutes consistent metadata. +During development of the online checking code, fstests was modified to run +``xfs_scrub -n`` between each test to ensure that the new checking code +produces the same results as the two existing fsck tools. + +To start development of online repair, fstests was modified to run +``xfs_repair`` to rebuild the filesystem's metadata indices between tests. +This ensures that offline repair does not crash, leave a corrupt filesystem +after it exists, or trigger complaints from the online check. +This also established a baseline for what can and cannot be repaired offline. +To complete the first phase of development of online repair, fstests was +modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. +This enables a comparison of the effectiveness of online repair as compared to +the existing offline repair tools. + +General Fuzz Testing of Metadata Blocks +--------------------------------------- + +XFS benefits greatly from having a very robust debugging tool, ``xfs_db``. + +Before development of online fsck even began, a set of fstests were created +to test the rather common fault that entire metadata blocks get corrupted. +This required the creation of fstests library code that can create a filesystem +containing every possible type of metadata object. +Next, individual test cases were created to create a test filesystem, identify +a single block of a specific type of metadata object, trash it with the +existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a +particular metadata validation strategy. + +This earlier test suite enabled XFS developers to test the ability of the +in-kernel validation functions and the ability of the offline fsck tool to +detect and eliminate the inconsistent metadata. +This part of the test suite was extended to cover online fsck in exactly the +same manner. + +In other words, for a given fstests filesystem configuration: + +* For each metadata object existing on the filesystem: + + * Write garbage to it + + * Test the reactions of: + + 1. The kernel verifiers to stop obviously bad metadata + 2. Offline repair (``xfs_repair``) to detect and fix + 3. Online repair (``xfs_scrub``) to detect and fix + +Targeted Fuzz Testing of Metadata Records +----------------------------------------- + +The testing plan for online fsck includes extending the existing fs testing +infrastructure to provide a much more powerful facility: targeted fuzz testing +of every metadata field of every metadata object in the filesystem. +``xfs_db`` can modify every field of every metadata structure in every +block in the filesystem to simulate the effects of memory corruption and +software bugs. +Given that fstests already contains the ability to create a filesystem +containing every metadata format known to the filesystem, ``xfs_db`` can be +used to perform exhaustive fuzz testing! + +For a given fstests filesystem configuration: + +* For each metadata object existing on the filesystem... + + * For each record inside that metadata object... + + * For each field inside that record... + + * For each conceivable type of transformation that can be applied to a bit field... + + 1. Clear all bits + 2. Set all bits + 3. Toggle the most significant bit + 4. Toggle the middle bit + 5. Toggle the least significant bit + 6. Add a small quantity + 7. Subtract a small quantity + 8. Randomize the contents + + * ...test the reactions of: + + 1. The kernel verifiers to stop obviously bad metadata + 2. Offline checking (``xfs_repair -n``) + 3. Offline repair (``xfs_repair``) + 4. Online checking (``xfs_scrub -n``) + 5. Online repair (``xfs_scrub``) + 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) + +This is quite the combinatoric explosion! + +Fortunately, having this much test coverage makes it easy for XFS developers to +check the responses of XFS' fsck tools. +Since the introduction of the fuzz testing framework, these tests have been +used to discover incorrect repair code and missing functionality for entire +classes of metadata objects in ``xfs_repair``. +The enhanced testing was used to finalize the deprecation of ``xfs_check`` by +confirming that ``xfs_repair`` could detect at least as many corruptions as +the older tool. + +These tests have been very valuable for ``xfs_scrub`` in the same ways -- they +allow the online fsck developers to compare online fsck against offline fsck, +and they enable XFS developers to find deficiencies in the code base. + +Proposed patchsets include +`general fuzzer improvements +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_, +`fuzzing baselines +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_, +and `improvements in fuzz testing comprehensiveness +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_. + +Stress Testing +-------------- + +A unique requirement to online fsck is the ability to operate on a filesystem +concurrently with regular workloads. +Although it is of course impossible to run ``xfs_scrub`` with *zero* observable +impact on the running system, the online repair code should never introduce +inconsistencies into the filesystem metadata, and regular workloads should +never notice resource starvation. +To verify that these conditions are being met, fstests has been enhanced in +the following ways: + +* For each scrub item type, create a test to exercise checking that item type + while running ``fsstress``. +* For each scrub item type, create a test to exercise repairing that item type + while running ``fsstress``. +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole + filesystem doesn't cause problems. +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that + force-repairing the whole filesystem doesn't cause problems. +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while + freezing and thawing the filesystem. +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while + remounting the filesystem read-only and read-write. +* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) + +Success is defined by the ability to run all of these tests without observing +any unexpected filesystem shutdowns due to corrupted metadata, kernel hang +check warnings, or any other sort of mischief. + +Proposed patchsets include `general stress testing +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_ +and the `evolution of existing per-function stress testing +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_. + +4. User Interface +================= + +The primary user of online fsck is the system administrator, just like offline +repair. +Online fsck presents two modes of operation to administrators: +A foreground CLI process for online fsck on demand, and a background service +that performs autonomous checking and repair. + +Checking on Demand +------------------ + +For administrators who want the absolute freshest information about the +metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on +a command line. +The program checks every piece of metadata in the filesystem while the +administrator waits for the results to be reported, just like the existing +``xfs_repair`` tool. +Both tools share a ``-n`` option to perform a read-only scan, and a ``-v`` +option to increase the verbosity of the information reported. + +A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error +correction capabilities of the hardware to check data file contents. +The media scan is not enabled by default because it may dramatically increase +program runtime and consume a lot of bandwidth on older storage hardware. + +The output of a foreground invocation is captured in the system log. + +The ``xfs_scrub_all`` program walks the list of mounted filesystems and +initiates ``xfs_scrub`` for each of them in parallel. +It serializes scans for any filesystems that resolve to the same top level +kernel block device to prevent resource overconsumption. + +Background Service +------------------ + +To reduce the workload of system administrators, the ``xfs_scrub`` package +provides a suite of `systemd <https://systemd.io/>`_ timers and services that +run online fsck automatically on weekends by default. +The background service configures scrub to run with as little privilege as +possible, the lowest CPU and IO priority, and in a CPU-constrained single +threaded mode. +This can be tuned by the systemd administrator at any time to suit the latency +and throughput requirements of customer workloads. + +The output of the background service is also captured in the system log. +If desired, reports of failures (either due to inconsistencies or mere runtime +errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment +variable in the following service files: + +* ``xfs_scrub_fail@.service`` +* ``xfs_scrub_media_fail@.service`` +* ``xfs_scrub_all_fail.service`` + +The decision to enable the background scan is left to the system administrator. +This can be done by enabling either of the following services: + +* ``xfs_scrub_all.timer`` on systemd systems +* ``xfs_scrub_all.cron`` on non-systemd systems + +This automatic weekly scan is configured out of the box to perform an +additional media scan of all file data once per month. +This is less foolproof than, say, storing file data block checksums, but much +more performant if application software provides its own integrity checking, +redundancy can be provided elsewhere above the filesystem, or the storage +device's integrity guarantees are deemed sufficient. + +The systemd unit file definitions have been subjected to a security audit +(as of systemd 249) to ensure that the xfs_scrub processes have as little +access to the rest of the system as possible. +This was performed via ``systemd-analyze security``, after which privileges +were restricted to the minimum required, sandboxing was set up to the maximal +extent possible with sandboxing and system call filtering; and access to the +filesystem tree was restricted to the minimum needed to start the program and +access the filesystem being scanned. +The service definition files restrict CPU usage to 80% of one CPU core, and +apply as nice of a priority to IO and CPU scheduling as possible. +This measure was taken to minimize delays in the rest of the filesystem. +No such hardening has been performed for the cron job. + +Proposed patchset: +`Enabling the xfs_scrub background service +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_. + +Health Reporting +---------------- + +XFS caches a summary of each filesystem's health status in memory. +The information is updated whenever ``xfs_scrub`` is run, or whenever +inconsistencies are detected in the filesystem metadata during regular +operations. +System administrators should use the ``health`` command of ``xfs_spaceman`` to +download this information into a human-readable format. +If problems have been observed, the administrator can schedule a reduced +service window to run the online repair tool to correct the problem. +Failing that, the administrator can decide to schedule a maintenance window to +run the traditional offline repair tool to correct the problem. + +**Future Work Question**: Should the health reporting integrate with the new +inotify fs error notification system? +Would it be helpful for sysadmins to have a daemon to listen for corruption +notifications and initiate a repair? + +*Answer*: These questions remain unanswered, but should be a part of the +conversation with early adopters and potential downstream users of XFS. + +Proposed patchsets include +`wiring up health reports to correction returns +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_ +and +`preservation of sickness info during memory reclaim +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_. + +5. Kernel Algorithms and Data Structures +======================================== + +This section discusses the key algorithms and data structures of the kernel +code that provide the ability to check and repair metadata while the system +is running. +The first chapters in this section reveal the pieces that provide the +foundation for checking metadata. +The remainder of this section presents the mechanisms through which XFS +regenerates itself. + +Self Describing Metadata +------------------------ + +Starting with XFS version 5 in 2012, XFS updated the format of nearly every +ondisk block header to record a magic number, a checksum, a universally +"unique" identifier (UUID), an owner code, the ondisk address of the block, +and a log sequence number. +When loading a block buffer from disk, the magic number, UUID, owner, and +ondisk address confirm that the retrieved block matches the specific owner of +the current filesystem, and that the information contained in the block is +supposed to be found at the ondisk address. +The first three components enable checking tools to disregard alleged metadata +that doesn't belong to the filesystem, and the fourth component enables the +filesystem to detect lost writes. + +Whenever a file system operation modifies a block, the change is submitted +to the log as part of a transaction. +The log then processes these transactions marking them done once they are +safely persisted to storage. +The logging code maintains the checksum and the log sequence number of the last +transactional update. +Checksums are useful for detecting torn writes and other discrepancies that can +be introduced between the computer and its storage devices. +Sequence number tracking enables log recovery to avoid applying out of date +log updates to the filesystem. + +These two features improve overall runtime resiliency by providing a means for +the filesystem to detect obvious corruption when reading metadata blocks from +disk, but these buffer verifiers cannot provide any consistency checking +between metadata structures. + +For more information, please see the documentation for +Documentation/filesystems/xfs/xfs-self-describing-metadata.rst + +Reverse Mapping +--------------- + +The original design of XFS (circa 1993) is an improvement upon 1980s Unix +filesystem design. +In those days, storage density was expensive, CPU time was scarce, and +excessive seek time could kill performance. +For performance reasons, filesystem authors were reluctant to add redundancy to +the filesystem, even at the cost of data integrity. +Filesystems designers in the early 21st century choose different strategies to +increase internal redundancy -- either storing nearly identical copies of +metadata, or more space-efficient encoding techniques. + +For XFS, a different redundancy strategy was chosen to modernize the design: +a secondary space usage index that maps allocated disk extents back to their +owners. +By adding a new index, the filesystem retains most of its ability to scale +well to heavily threaded workloads involving large datasets, since the primary +file metadata (the directory tree, the file block map, and the allocation +groups) remain unchanged. +Like any system that improves redundancy, the reverse-mapping feature increases +overhead costs for space mapping activities. +However, it has two critical advantages: first, the reverse index is key to +enabling online fsck and other requested functionality such as free space +defragmentation, better media failure reporting, and filesystem shrinking. +Second, the different ondisk storage format of the reverse mapping btree +defeats device-level deduplication because the filesystem requires real +redundancy. + ++--------------------------------------------------------------------------+ +| **Sidebar**: | ++--------------------------------------------------------------------------+ +| A criticism of adding the secondary index is that it does nothing to | +| improve the robustness of user data storage itself. | +| This is a valid point, but adding a new index for file data block | +| checksums increases write amplification by turning data overwrites into | +| copy-writes, which age the filesystem prematurely. | +| In keeping with thirty years of precedent, users who want file data | +| integrity can supply as powerful a solution as they require. | +| As for metadata, the complexity of adding a new secondary index of space | +| usage is much less than adding volume management and storage device | +| mirroring to XFS itself. | +| Perfection of RAID and volume management are best left to existing | +| layers in the kernel. | ++--------------------------------------------------------------------------+ + +The information captured in a reverse space mapping record is as follows: + +.. code-block:: c + + struct xfs_rmap_irec { + xfs_agblock_t rm_startblock; /* extent start block */ + xfs_extlen_t rm_blockcount; /* extent length */ + uint64_t rm_owner; /* extent owner */ + uint64_t rm_offset; /* offset within the owner */ + unsigned int rm_flags; /* state flags */ + }; + +The first two fields capture the location and size of the physical space, +in units of filesystem blocks. +The owner field tells scrub which metadata structure or file inode have been +assigned this space. +For space allocated to files, the offset field tells scrub where the space was +mapped within the file fork. +Finally, the flags field provides extra information about the space usage -- +is this an attribute fork extent? A file mapping btree extent? Or an +unwritten data extent? + +Online filesystem checking judges the consistency of each primary metadata +record by comparing its information against all other space indices. +The reverse mapping index plays a key role in the consistency checking process +because it contains a centralized alternate copy of all space allocation +information. +Program runtime and ease of resource acquisition are the only real limits to +what online checking can consult. +For example, a file data extent mapping can be checked against: + +* The absence of an entry in the free space information. +* The absence of an entry in the inode index. +* The absence of an entry in the reference count data if the file is not + marked as having shared extents. +* The correspondence of an entry in the reverse mapping information. + +There are several observations to make about reverse mapping indices: + +1. Reverse mappings can provide a positive affirmation of correctness if any of + the above primary metadata are in doubt. + The checking code for most primary metadata follows a path similar to the + one outlined above. + +2. Proving the consistency of secondary metadata with the primary metadata is + difficult because that requires a full scan of all primary space metadata, + which is very time intensive. + For example, checking a reverse mapping record for a file extent mapping + btree block requires locking the file and searching the entire btree to + confirm the block. + Instead, scrub relies on rigorous cross-referencing during the primary space + mapping structure checks. + +3. Consistency scans must use non-blocking lock acquisition primitives if the + required locking order is not the same order used by regular filesystem + operations. + For example, if the filesystem normally takes a file ILOCK before taking + the AGF buffer lock but scrub wants to take a file ILOCK while holding + an AGF buffer lock, scrub cannot block on that second acquisition. + This means that forward progress during this part of a scan of the reverse + mapping data cannot be guaranteed if system load is heavy. + +In summary, reverse mappings play a key role in reconstruction of primary +metadata. +The details of how these records are staged, written to disk, and committed +into the filesystem are covered in subsequent sections. + +Checking and Cross-Referencing +------------------------------ + +The first step of checking a metadata structure is to examine every record +contained within the structure and its relationship with the rest of the +system. +XFS contains multiple layers of checking to try to prevent inconsistent +metadata from wreaking havoc on the system. +Each of these layers contributes information that helps the kernel to make +three decisions about the health of a metadata structure: + +- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ? +- Is this structure inconsistent with the rest of the system + (``XFS_SCRUB_OFLAG_XCORRUPT``) ? +- Is there so much damage around the filesystem that cross-referencing is not + possible (``XFS_SCRUB_OFLAG_XFAIL``) ? +- Can the structure be optimized to improve performance or reduce the size of + metadata (``XFS_SCRUB_OFLAG_PREEN``) ? +- Does the structure contain data that is not inconsistent but deserves review + by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? + +The following sections describe how the metadata scrubbing process works. + +Metadata Buffer Verification +```````````````````````````` + +The lowest layer of metadata protection in XFS are the metadata verifiers built +into the buffer cache. +These functions perform inexpensive internal consistency checking of the block +itself, and answer these questions: + +- Does the block belong to this filesystem? + +- Does the block belong to the structure that asked for the read? + This assumes that metadata blocks only have one owner, which is always true + in XFS. + +- Is the type of data stored in the block within a reasonable range of what + scrub is expecting? + +- Does the physical location of the block match the location it was read from? + +- Does the block checksum match the data? + +The scope of the protections here are very limited -- verifiers can only +establish that the filesystem code is reasonably free of gross corruption bugs +and that the storage system is reasonably competent at retrieval. +Corruption problems observed at runtime cause the generation of health reports, +failed system calls, and in the extreme case, filesystem shutdowns if the +corrupt metadata force the cancellation of a dirty transaction. + +Every online fsck scrubbing function is expected to read every ondisk metadata +block of a structure in the course of checking the structure. +Corruption problems observed during a check are immediately reported to +userspace as corruption; during a cross-reference, they are reported as a +failure to cross-reference once the full examination is complete. +Reads satisfied by a buffer already in cache (and hence already verified) +bypass these checks. + +Internal Consistency Checks +``````````````````````````` + +After the buffer cache, the next level of metadata protection is the internal +record verification code built into the filesystem. +These checks are split between the buffer verifiers, the in-filesystem users of +the buffer cache, and the scrub code itself, depending on the amount of higher +level context required. +The scope of checking is still internal to the block. +These higher level checking functions answer these questions: + +- Does the type of data stored in the block match what scrub is expecting? + +- Does the block belong to the owning structure that asked for the read? + +- If the block contains records, do the records fit within the block? + +- If the block tracks internal free space information, is it consistent with + the record areas? + +- Are the records contained inside the block free of obvious corruptions? + +Record checks in this category are more rigorous and more time-intensive. +For example, block pointers and inumbers are checked to ensure that they point +within the dynamically allocated parts of an allocation group and within +the filesystem. +Names are checked for invalid characters, and flags are checked for invalid +combinations. +Other record attributes are checked for sensible values. +Btree records spanning an interval of the btree keyspace are checked for +correct order and lack of mergeability (except for file fork mappings). +For performance reasons, regular code may skip some of these checks unless +debugging is enabled or a write is about to occur. +Scrub functions, of course, must check all possible problems. + +Validation of Userspace-Controlled Record Attributes +```````````````````````````````````````````````````` + +Various pieces of filesystem metadata are directly controlled by userspace. +Because of this nature, validation work cannot be more precise than checking +that a value is within the possible range. +These fields include: + +- Superblock fields controlled by mount options +- Filesystem labels +- File timestamps +- File permissions +- File size +- File flags +- Names present in directory entries, extended attribute keys, and filesystem + labels +- Extended attribute key namespaces +- Extended attribute values +- File data block contents +- Quota limits +- Quota timer expiration (if resource usage exceeds the soft limit) + +Cross-Referencing Space Metadata +```````````````````````````````` + +After internal block checks, the next higher level of checking is +cross-referencing records between metadata structures. +For regular runtime code, the cost of these checks is considered to be +prohibitively expensive, but as scrub is dedicated to rooting out +inconsistencies, it must pursue all avenues of inquiry. +The exact set of cross-referencing is highly dependent on the context of the +data structure being checked. + +The XFS btree code has keyspace scanning functions that online fsck uses to +cross reference one structure with another. +Specifically, scrub can scan the key space of an index to determine if that +keyspace is fully, sparsely, or not at all mapped to records. +For the reverse mapping btree, it is possible to mask parts of the key for the +purposes of performing a keyspace scan so that scrub can decide if the rmap +btree contains records mapping a certain extent of physical space without the +sparsenses of the rest of the rmap keyspace getting in the way. + +Btree blocks undergo the following checks before cross-referencing: + +- Does the type of data stored in the block match what scrub is expecting? + +- Does the block belong to the owning structure that asked for the read? + +- Do the records fit within the block? + +- Are the records contained inside the block free of obvious corruptions? + +- Are the name hashes in the correct order? + +- Do node pointers within the btree point to valid block addresses for the type + of btree? + +- Do child pointers point towards the leaves? + +- Do sibling pointers point across the same level? + +- For each node block record, does the record key accurate reflect the contents + of the child block? + +Space allocation records are cross-referenced as follows: + +1. Any space mentioned by any metadata structure are cross-referenced as + follows: + + - Does the reverse mapping index list only the appropriate owner as the + owner of each block? + + - Are none of the blocks claimed as free space? + + - If these aren't file data blocks, are none of the blocks claimed as space + shared by different owners? + +2. Btree blocks are cross-referenced as follows: + + - Everything in class 1 above. + + - If there's a parent node block, do the keys listed for this block match the + keyspace of this block? + + - Do the sibling pointers point to valid blocks? Of the same level? + + - Do the child pointers point to valid blocks? Of the next level down? + +3. Free space btree records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Does the reverse mapping index list no owners of this space? + + - Is this space not claimed by the inode index for inodes? + + - Is it not mentioned by the reference count index? + + - Is there a matching record in the other free space btree? + +4. Inode btree records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Is there a matching record in free inode btree? + + - Do cleared bits in the holemask correspond with inode clusters? + + - Do set bits in the freemask correspond with inode records with zero link + count? + +5. Inode records are cross-referenced as follows: + + - Everything in class 1. + + - Do all the fields that summarize information about the file forks actually + match those forks? + + - Does each inode with zero link count correspond to a record in the free + inode btree? + +6. File fork space mapping records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Is this space not mentioned by the inode btrees? + + - If this is a CoW fork mapping, does it correspond to a CoW entry in the + reference count btree? + +7. Reference count records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Within the space subkeyspace of the rmap btree (that is to say, all + records mapped to a particular space extent and ignoring the owner info), + are there the same number of reverse mapping records for each block as the + reference count record claims? + +Proposed patchsets are the series to find gaps in +`refcount btree +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_, +`inode btree +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and +`rmap btree +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records; +to find +`mergeable records +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_; +and to +`improve cross referencing with rmap +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_ +before starting a repair. + +Checking Extended Attributes +```````````````````````````` + +Extended attributes implement a key-value store that enable fragments of data +to be attached to any file. +Both the kernel and userspace can access the keys and values, subject to +namespace and privilege restrictions. +Most typically these fragments are metadata about the file -- origins, security +contexts, user-supplied labels, indexing information, etc. + +Names can be as long as 255 bytes and can exist in several different +namespaces. +Values can be as large as 64KB. +A file's extended attributes are stored in blocks mapped by the attr fork. +The mappings point to leaf blocks, remote value blocks, or dabtree blocks. +Block 0 in the attribute fork is always the top of the structure, but otherwise +each of the three types of blocks can be found at any offset in the attr fork. +Leaf blocks contain attribute key records that point to the name and the value. +Names are always stored elsewhere in the same leaf block. +Values that are less than 3/4 the size of a filesystem block are also stored +elsewhere in the same leaf block. +Remote value blocks contain values that are too large to fit inside a leaf. +If the leaf information exceeds a single filesystem block, a dabtree (also +rooted at block 0) is created to map hashes of the attribute names to leaf +blocks in the attr fork. + +Checking an extended attribute structure is not so straightforward due to the +lack of separation between attr blocks and index blocks. +Scrub must read each block mapped by the attr fork and ignore the non-leaf +blocks: + +1. Walk the dabtree in the attr fork (if present) to ensure that there are no + irregularities in the blocks or dabtree mappings that do not point to + attr leaf blocks. + +2. Walk the blocks of the attr fork looking for leaf blocks. + For each entry inside a leaf: + + a. Validate that the name does not contain invalid characters. + + b. Read the attr value. + This performs a named lookup of the attr name to ensure the correctness + of the dabtree. + If the value is stored in a remote block, this also validates the + integrity of the remote value block. + +Checking and Cross-Referencing Directories +`````````````````````````````````````````` + +The filesystem directory tree is a directed acylic graph structure, with files +constituting the nodes, and directory entries (dirents) constituting the edges. +Directories are a special type of file containing a set of mappings from a +255-byte sequence (name) to an inumber. +These are called directory entries, or dirents for short. +Each directory file must have exactly one directory pointing to the file. +A root directory points to itself. +Directory entries point to files of any type. +Each non-directory file may have multiple directories point to it. + +In XFS, directories are implemented as a file containing up to three 32GB +partitions. +The first partition contains directory entry data blocks. +Each data block contains variable-sized records associating a user-provided +name with an inumber and, optionally, a file type. +If the directory entry data grows beyond one block, the second partition (which +exists as post-EOF extents) is populated with a block containing free space +information and an index that maps hashes of the dirent names to directory data +blocks in the first partition. +This makes directory name lookups very fast. +If this second partition grows beyond one block, the third partition is +populated with a linear array of free space information for faster +expansions. +If the free space has been separated and the second partition grows again +beyond one block, then a dabtree is used to map hashes of dirent names to +directory data blocks. + +Checking a directory is pretty straightforward: + +1. Walk the dabtree in the second partition (if present) to ensure that there + are no irregularities in the blocks or dabtree mappings that do not point to + dirent blocks. + +2. Walk the blocks of the first partition looking for directory entries. + Each dirent is checked as follows: + + a. Does the name contain no invalid characters? + + b. Does the inumber correspond to an actual, allocated inode? + + c. Does the child inode have a nonzero link count? + + d. If a file type is included in the dirent, does it match the type of the + inode? + + e. If the child is a subdirectory, does the child's dotdot pointer point + back to the parent? + + f. If the directory has a second partition, perform a named lookup of the + dirent name to ensure the correctness of the dabtree. + +3. Walk the free space list in the third partition (if present) to ensure that + the free spaces it describes are really unused. + +Checking operations involving :ref:`parents <dirparent>` and +:ref:`file link counts <nlinks>` are discussed in more detail in later +sections. + +Checking Directory/Attribute Btrees +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As stated in previous sections, the directory/attribute btree (dabtree) index +maps user-provided names to improve lookup times by avoiding linear scans. +Internally, it maps a 32-bit hash of the name to a block offset within the +appropriate file fork. + +The internal structure of a dabtree closely resembles the btrees that record +fixed-size metadata records -- each dabtree block contains a magic number, a +checksum, sibling pointers, a UUID, a tree level, and a log sequence number. +The format of leaf and node records are the same -- each entry points to the +next level down in the hierarchy, with dabtree node records pointing to dabtree +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere +in the fork. + +Checking and cross-referencing the dabtree is very similar to what is done for +space btrees: + +- Does the type of data stored in the block match what scrub is expecting? + +- Does the block belong to the owning structure that asked for the read? + +- Do the records fit within the block? + +- Are the records contained inside the block free of obvious corruptions? + +- Are the name hashes in the correct order? + +- Do node pointers within the dabtree point to valid fork offsets for dabtree + blocks? + +- Do leaf pointers within the dabtree point to valid fork offsets for directory + or attr leaf blocks? + +- Do child pointers point towards the leaves? + +- Do sibling pointers point across the same level? + +- For each dabtree node record, does the record key accurate reflect the + contents of the child dabtree block? + +- For each dabtree leaf record, does the record key accurate reflect the + contents of the directory or attr block? + +Cross-Referencing Summary Counters +`````````````````````````````````` + +XFS maintains three classes of summary counters: available resources, quota +resource usage, and file link counts. + +In theory, the amount of available resources (data blocks, inodes, realtime +extents) can be found by walking the entire filesystem. +This would make for very slow reporting, so a transactional filesystem can +maintain summaries of this information in the superblock. +Cross-referencing these values against the filesystem metadata should be a +simple matter of walking the free space and inode metadata in each AG and the +realtime bitmap, but there are complications that will be discussed in +:ref:`more detail <fscounters>` later. + +:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>` +checking are sufficiently complicated to warrant separate sections. + +Post-Repair Reverification +`````````````````````````` + +After performing a repair, the checking code is run a second time to validate +the new structure, and the results of the health assessment are recorded +internally and returned to the calling process. +This step is critical for enabling system administrator to monitor the status +of the filesystem and the progress of any repairs. +For developers, it is a useful means to judge the efficacy of error detection +and correction in the online and offline checking tools. + +Eventual Consistency vs. Online Fsck +------------------------------------ + +Complex operations can make modifications to multiple per-AG data structures +with a chain of transactions. +These chains, once committed to the log, are restarted during log recovery if +the system crashes while processing the chain. +Because the AG header buffers are unlocked between transactions within a chain, +online checking must coordinate with chained operations that are in progress to +avoid incorrectly detecting inconsistencies due to pending chains. +Furthermore, online repair must not run when operations are pending because +the metadata are temporarily inconsistent with each other, and rebuilding is +not possible. + +Only online fsck has this requirement of total consistency of AG metadata, and +should be relatively rare as compared to filesystem change operations. +Online fsck coordinates with transaction chains as follows: + +* For each AG, maintain a count of intent items targeting that AG. + The count should be bumped whenever a new item is added to the chain. + The count should be dropped when the filesystem has locked the AG header + buffers and finished the work. + +* When online fsck wants to examine an AG, it should lock the AG header + buffers to quiesce all transaction chains that want to modify that AG. + If the count is zero, proceed with the checking operation. + If it is nonzero, cycle the buffer locks to allow the chain to make forward + progress. + +This may lead to online fsck taking a long time to complete, but regular +filesystem updates take precedence over background checking activity. +Details about the discovery of this situation are presented in the +:ref:`next section <chain_coordination>`, and details about the solution +are presented :ref:`after that<intent_drains>`. + +.. _chain_coordination: + +Discovery of the Problem +```````````````````````` + +Midway through the development of online scrubbing, the fsstress tests +uncovered a misinteraction between online fsck and compound transaction chains +created by other writer threads that resulted in false reports of metadata +inconsistency. +The root cause of these reports is the eventual consistency model introduced by +the expansion of deferred work items and compound transaction chains when +reverse mapping and reflink were introduced. + +Originally, transaction chains were added to XFS to avoid deadlocks when +unmapping space from files. +Deadlock avoidance rules require that AGs only be locked in increasing order, +which makes it impossible (say) to use a single transaction to free a space +extent in AG 7 and then try to free a now superfluous block mapping btree block +in AG 3. +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log +items to commit to freeing some space in one transaction while deferring the +actual metadata updates to a fresh transaction. +The transaction sequence looks like this: + +1. The first transaction contains a physical update to the file's block mapping + structures to remove the mapping from the btree blocks. + It then attaches to the in-memory transaction an action item to schedule + deferred freeing of space. + Concretely, each transaction maintains a list of ``struct + xfs_defer_pending`` objects, each of which maintains a list of ``struct + xfs_extent_free_item`` objects. + Returning to the example above, the action item tracks the freeing of both + the unmapped space from AG 7 and the block mapping btree (BMBT) block from + AG 3. + Deferred frees recorded in this manner are committed in the log by creating + an EFI log item from the ``struct xfs_extent_free_item`` object and + attaching the log item to the transaction. + When the log is persisted to disk, the EFI item is written into the ondisk + transaction record. + EFIs can list up to 16 extents to free, all sorted in AG order. + +2. The second transaction contains a physical update to the free space btrees + of AG 3 to release the former BMBT block and a second physical update to the + free space btrees of AG 7 to release the unmapped file space. + Observe that the physical updates are resequenced in the correct order + when possible. + Attached to the transaction is a an extent free done (EFD) log item. + The EFD contains a pointer to the EFI logged in transaction #1 so that log + recovery can tell if the EFI needs to be replayed. + +If the system goes down after transaction #1 is written back to the filesystem +but before #2 is committed, a scan of the filesystem metadata would show +inconsistent filesystem metadata because there would not appear to be any owner +of the unmapped space. +Happily, log recovery corrects this inconsistency for us -- when recovery finds +an intent log item but does not find a corresponding intent done item, it will +reconstruct the incore state of the intent item and finish it. +In the example above, the log must replay both frees described in the recovered +EFI to complete the recovery phase. + +There are subtleties to XFS' transaction chaining strategy to consider: + +* Log items must be added to a transaction in the correct order to prevent + conflicts with principal objects that are not held by the transaction. + In other words, all per-AG metadata updates for an unmapped block must be + completed before the last update to free the extent, and extents should not + be reallocated until that last update commits to the log. + +* AG header buffers are released between each transaction in a chain. + This means that other threads can observe an AG in an intermediate state, + but as long as the first subtlety is handled, this should not affect the + correctness of filesystem operations. + +* Unmounting the filesystem flushes all pending work to disk, which means that + offline fsck never sees the temporary inconsistencies caused by deferred + work item processing. + +In this manner, XFS employs a form of eventual consistency to avoid deadlocks +and increase parallelism. + +During the design phase of the reverse mapping and reflink features, it was +decided that it was impractical to cram all the reverse mapping updates for a +single filesystem change into a single transaction because a single file +mapping operation can explode into many small updates: + +* The block mapping update itself +* A reverse mapping update for the block mapping update +* Fixing the freelist +* A reverse mapping update for the freelist fix + +* A shape change to the block mapping btree +* A reverse mapping update for the btree update +* Fixing the freelist (again) +* A reverse mapping update for the freelist fix + +* An update to the reference counting information +* A reverse mapping update for the refcount update +* Fixing the freelist (a third time) +* A reverse mapping update for the freelist fix + +* Freeing any space that was unmapped and not owned by any other file +* Fixing the freelist (a fourth time) +* A reverse mapping update for the freelist fix + +* Freeing the space used by the block mapping btree +* Fixing the freelist (a fifth time) +* A reverse mapping update for the freelist fix + +Free list fixups are not usually needed more than once per AG per transaction +chain, but it is theoretically possible if space is very tight. +For copy-on-write updates this is even worse, because this must be done once to +remove the space from a staging area and again to map it into the file! + +To deal with this explosion in a calm manner, XFS expands its use of deferred +work items to cover most reverse mapping updates and all refcount updates. +This reduces the worst case size of transaction reservations by breaking the +work into a long chain of small updates, which increases the degree of eventual +consistency in the system. +Again, this generally isn't a problem because XFS orders its deferred work +items carefully to avoid resource reuse conflicts between unsuspecting threads. + +However, online fsck changes the rules -- remember that although physical +updates to per-AG structures are coordinated by locking the buffers for AG +headers, buffer locks are dropped between transactions. +Once scrub acquires resources and takes locks for a data structure, it must do +all the validation work without releasing the lock. +If the main lock for a space btree is an AG header buffer lock, scrub may have +interrupted another thread that is midway through finishing a chain. +For example, if a thread performing a copy-on-write has completed a reverse +mapping update but not the corresponding refcount update, the two AG btrees +will appear inconsistent to scrub and an observation of corruption will be +recorded. This observation will not be correct. +If a repair is attempted in this state, the results will be catastrophic! + +Several other solutions to this problem were evaluated upon discovery of this +flaw and rejected: + +1. Add a higher level lock to allocation groups and require writer threads to + acquire the higher level lock in AG order before making any changes. + This would be very difficult to implement in practice because it is + difficult to determine which locks need to be obtained, and in what order, + without simulating the entire operation. + Performing a dry run of a file operation to discover necessary locks would + make the filesystem very slow. + +2. Make the deferred work coordinator code aware of consecutive intent items + targeting the same AG and have it hold the AG header buffers locked across + the transaction roll between updates. + This would introduce a lot of complexity into the coordinator since it is + only loosely coupled with the actual deferred work items. + It would also fail to solve the problem because deferred work items can + generate new deferred subtasks, but all subtasks must be complete before + work can start on a new sibling task. + +3. Teach online fsck to walk all transactions waiting for whichever lock(s) + protect the data structure being scrubbed to look for pending operations. + The checking and repair operations must factor these pending operations into + the evaluations being performed. + This solution is a nonstarter because it is *extremely* invasive to the main + filesystem. + +.. _intent_drains: + +Intent Drains +````````````` + +Online fsck uses an atomic intent item counter and lock cycling to coordinate +with transaction chains. +There are two key properties to the drain mechanism. +First, the counter is incremented when a deferred work item is *queued* to a +transaction, and it is decremented after the associated intent done log item is +*committed* to another transaction. +The second property is that deferred work can be added to a transaction without +holding an AG header lock, but per-AG work items cannot be marked done without +locking that AG header buffer to log the physical updates and the intent done +log item. +The first property enables scrub to yield to running transaction chains, which +is an explicit deprioritization of online fsck to benefit file operations. +The second property of the drain is key to the correct coordination of scrub, +since scrub will always be able to decide if a conflict is possible. + +For regular filesystem code, the drain works as follows: + +1. Call the appropriate subsystem function to add a deferred work item to a + transaction. + +2. The function calls ``xfs_defer_drain_bump`` to increase the counter. + +3. When the deferred item manager wants to finish the deferred work item, it + calls ``->finish_item`` to complete it. + +4. The ``->finish_item`` implementation logs some changes and calls + ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads + waiting on the drain. + +5. The subtransaction commits, which unlocks the resource associated with the + intent item. + +For scrub, the drain works as follows: + +1. Lock the resource(s) associated with the metadata being scrubbed. + For example, a scan of the refcount btree would lock the AGI and AGF header + buffers. + +2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no + chains in progress and the operation may proceed. + +3. Otherwise, release the resources grabbed in step 1. + +4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go + back to step 1 unless a signal has been caught. + +To avoid polling in step 4, the drain provides a waitqueue for scrub threads to +be woken up whenever the intent count drops to zero. + +The proposed patchset is the +`scrub intent drain series +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_. + +.. _jump_labels: + +Static Keys (aka Jump Label Patching) +````````````````````````````````````` + +Online fsck for XFS separates the regular filesystem from the checking and +repair code as much as possible. +However, there are a few parts of online fsck (such as the intent drains, and +later, live update hooks) where it is useful for the online fsck code to know +what's going on in the rest of the filesystem. +Since it is not expected that online fsck will be constantly running in the +background, it is very important to minimize the runtime overhead imposed by +these hooks when online fsck is compiled into the kernel but not actively +running on behalf of userspace. +Taking locks in the hot path of a writer thread to access a data structure only +to find that no further action is necessary is expensive -- on the author's +computer, this have an overhead of 40-50ns per access. +Fortunately, the kernel supports dynamic code patching, which enables XFS to +replace a static branch to hook code with ``nop`` sleds when online fsck isn't +running. +This sled has an overhead of however long it takes the instruction decoder to +skip past the sled, which seems to be on the order of less than 1ns and +does not access memory outside of instruction fetching. + +When online fsck enables the static key, the sled is replaced with an +unconditional branch to call the hook code. +The switchover is quite expensive (~22000ns) but is paid entirely by the +program that invoked online fsck, and can be amortized if multiple threads +enter online fsck at the same time, or if multiple filesystems are being +checked at the same time. +Changing the branch direction requires taking the CPU hotplug lock, and since +CPU initialization requires memory allocation, online fsck must be careful not +to change a static key while holding any locks or resources that could be +accessed in the memory reclaim paths. +To minimize contention on the CPU hotplug lock, care should be taken not to +enable or disable static keys unnecessarily. + +Because static keys are intended to minimize hook overhead for regular +filesystem operations when xfs_scrub is not running, the intended usage +patterns are as follows: + +- The hooked part of XFS should declare a static-scoped static key that + defaults to false. + The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. + The static key itself should be declared as a ``static`` variable. + +- When deciding to invoke code that's only used by scrub, the regular + filesystem should call the ``static_branch_unlikely`` predicate to avoid the + scrub-only hook code if the static key is not enabled. + +- The regular filesystem should export helper functions that call + ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the + static key. + Wrapper functions make it easy to compile out the relevant code if the kernel + distributor turns off online fsck at build time. + +- Scrub functions wanting to turn on scrub-only XFS functionality should call + the ``xchk_fsgates_enable`` from the setup function to enable a specific + hook. + This must be done before obtaining any resources that are used by memory + reclaim. + Callers had better be sure they really need the functionality gated by the + static key; the ``TRY_HARDER`` flag is useful here. + +Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to +handle locking AGI and AGF buffers for all scrubber functions. +If it detects a conflict between scrub and the running transactions, it will +try to wait for intents to complete. +If the caller of the helper has not enabled the static key, the helper will +return -EDEADLOCK, which should result in the scrub being restarted with the +``TRY_HARDER`` flag set. +The scrub setup function should detect that flag, enable the static key, and +try the scrub again. +Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``. + +For more information, please see the kernel documentation of +Documentation/staging/static-keys.rst. + +.. _xfile: + +Pageable Kernel Memory +---------------------- + +Some online checking functions work by scanning the filesystem to build a +shadow copy of an ondisk metadata structure in memory and comparing the two +copies. +For online repair to rebuild a metadata structure, it must compute the record +set that will be stored in the new structure before it can persist that new +structure to disk. +Ideally, repairs complete with a single atomic commit that introduces +a new data structure. +To meet these goals, the kernel needs to collect a large amount of information +in a place that doesn't require the correct operation of the filesystem. + +Kernel memory isn't suitable because: + +* Allocating a contiguous region of memory to create a C array is very + difficult, especially on 32-bit systems. + +* Linked lists of records introduce double pointer overhead which is very high + and eliminate the possibility of indexed lookups. + +* Kernel memory is pinned, which can drive the system into OOM conditions. + +* The system might not have sufficient memory to stage all the information. + +At any given time, online fsck does not need to keep the entire record set in +memory, which means that individual records can be paged out if necessary. +Continued development of online fsck demonstrated that the ability to perform +indexed data storage would also be very useful. +Fortunately, the Linux kernel already has a facility for byte-addressable and +pageable storage: tmpfs. +In-kernel graphics drivers (most notably i915) take advantage of tmpfs files +to store intermediate data that doesn't need to be in memory at all times, so +that usage precedent is already established. +Hence, the ``xfile`` was born! + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| The first edition of online repair inserted records into a new btree as | +| it found them, which failed because filesystem could shut down with a | +| built data structure, which would be live after recovery finished. | +| | +| The second edition solved the half-rebuilt structure problem by storing | +| everything in memory, but frequently ran the system out of memory. | +| | +| The third edition solved the OOM problem by using linked lists, but the | +| memory overhead of the list pointers was extreme. | ++--------------------------------------------------------------------------+ + +xfile Access Models +``````````````````` + +A survey of the intended uses of xfiles suggested these use cases: + +1. Arrays of fixed-sized records (space management btrees, directory and + extended attribute entries) + +2. Sparse arrays of fixed-sized records (quotas and link counts) + +3. Large binary objects (BLOBs) of variable sizes (directory and extended + attribute names and values) + +4. Staging btrees in memory (reverse mapping btrees) + +5. Arbitrary contents (realtime space management) + +To support the first four use cases, high level data structures wrap the xfile +to share functionality between online fsck functions. +The rest of this section discusses the interfaces that the xfile presents to +four of those five higher level data structures. +The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case +study. + +The most general storage interface supported by the xfile enables the reading +and writing of arbitrary quantities of data at arbitrary offsets in the xfile. +This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions, +which behave similarly to their userspace counterparts. +XFS is very record-based, which suggests that the ability to load and store +complete records is important. +To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store`` +functions are provided to read and persist objects into an xfile. +They are internally the same as pread and pwrite, except that they treat any +error as an out of memory error. +For online repair, squashing error conditions in this manner is an acceptable +behavior because the only reaction is to abort the operation back to userspace. +All five xfile usecases can be serviced by these four functions. + +However, no discussion of file access idioms is complete without answering the +question, "But what about mmap?" +It is convenient to access storage directly with pointers, just like userspace +code does with regular memory. +Online fsck must not drive the system into OOM conditions, which means that +xfiles must be responsive to memory reclamation. +tmpfs can only push a pagecache folio to the swap cache if the folio is neither +pinned nor locked, which means the xfile must not pin too many folios. + +Short term direct access to xfile contents is done by locking the pagecache +folio and mapping it into kernel address space. +Programmatic access (e.g. pread and pwrite) uses this mechanism. +Folio locks are not supposed to be held for long periods of time, so long +term direct access to xfile contents is done by bumping the folio refcount, +mapping it into kernel address space, and dropping the folio lock. +These long term users *must* be responsive to memory reclaim by hooking into +the shrinker infrastructure to know when to release folios. + +The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to +retrieve the (locked) folio that backs part of an xfile and to release it. +The only code to use these folio lease functions are the xfarray +:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory +btrees<xfbtree>`. + +xfile Access Coordination +````````````````````````` + +For security reasons, xfiles must be owned privately by the kernel. +They are marked ``S_PRIVATE`` to prevent interference from the security system, +must never be mapped into process file descriptor tables, and their pages must +never be mapped into userspace processes. + +To avoid locking recursion issues with the VFS, all accesses to the shmfs file +are performed by manipulating the page cache directly. +xfile writers call the ``->write_begin`` and ``->write_end`` functions of the +xfile's address space to grab writable pages, copy the caller's buffer into the +page, and release the pages. +xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly +before copying the contents into the caller's buffer. +In other words, xfiles ignore the VFS read and write code paths to avoid +having to create a dummy ``struct kiocb`` and to avoid taking inode and +freeze locks. +tmpfs cannot be frozen, and xfiles must not be exposed to userspace. + +If an xfile is shared between threads to stage repairs, the caller must provide +its own locks to coordinate access. +For example, if a scrub function stores scan results in an xfile and needs +other threads to provide updates to the scanned data, the scrub function must +provide a lock for all threads to share. + +.. _xfarray: + +Arrays of Fixed-Sized Records +````````````````````````````` + +In XFS, each type of indexed space metadata (free space, inodes, reference +counts, file fork space, and reverse mappings) consists of a set of fixed-size +records indexed with a classic B+ tree. +Directories have a set of fixed-size dirent records that point to the names, +and extended attributes have a set of fixed-size attribute keys that point to +names and values. +Quota counters and file link counters index records with numbers. +During a repair, scrub needs to stage new records during the gathering step and +retrieve them during the btree building step. + +Although this requirement can be satisfied by calling the read and write +methods of the xfile directly, it is simpler for callers for there to be a +higher level abstraction to take care of computing array offsets, to provide +iterator functions, and to deal with sparse records and sorting. +The ``xfarray`` abstraction presents a linear array for fixed-size records atop +the byte-accessible xfile. + +.. _xfarray_access_patterns: + +Array Access Patterns +^^^^^^^^^^^^^^^^^^^^^ + +Array access patterns in online fsck tend to fall into three categories. +Iteration of records is assumed to be necessary for all cases and will be +covered in the next section. + +The first type of caller handles records that are indexed by position. +Gaps may exist between records, and a record may be updated multiple times +during the collection step. +In other words, these callers want a sparse linearly addressed table file. +The typical use case are quota records or file link count records. +Access to array elements is performed programmatically via ``xfarray_load`` and +``xfarray_store`` functions, which wrap the similarly-named xfile functions to +provide loading and storing of array elements at arbitrary array indices. +Gaps are defined to be null records, and null records are defined to be a +sequence of all zero bytes. +Null records are detected by calling ``xfarray_element_is_null``. +They are created either by calling ``xfarray_unset`` to null out an existing +record or by never storing anything to an array index. + +The second type of caller handles records that are not indexed by position +and do not require multiple updates to a record. +The typical use case here is rebuilding space btrees and key/value btrees. +These callers can add records to the array without caring about array indices +via the ``xfarray_append`` function, which stores a record at the end of the +array. +For callers that require records to be presentable in a specific order (e.g. +rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted +records; this function will be covered later. + +The third type of caller is a bag, which is useful for counting records. +The typical use case here is constructing space extent reference counts from +reverse mapping information. +Records can be put in the bag in any order, they can be removed from the bag +at any time, and uniqueness of records is left to callers. +The ``xfarray_store_anywhere`` function is used to insert a record in any +null record slot in the bag; and the ``xfarray_unset`` function removes a +record from the bag. + +The proposed patchset is the +`big in-memory array +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_. + +Iterating Array Elements +^^^^^^^^^^^^^^^^^^^^^^^^ + +Most users of the xfarray require the ability to iterate the records stored in +the array. +Callers can probe every possible array index with the following: + +.. code-block:: c + + xfarray_idx_t i; + foreach_xfarray_idx(array, i) { + xfarray_load(array, i, &rec); + + /* do something with rec */ + } + +All users of this idiom must be prepared to handle null records or must already +know that there aren't any. + +For xfarray users that want to iterate a sparse array, the ``xfarray_iter`` +function ignores indices in the xfarray that have never been written to by +calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas +of the array that are not populated with memory pages. +Once it finds a page, it will skip the zeroed areas of the page. + +.. code-block:: c + + xfarray_idx_t i = XFARRAY_CURSOR_INIT; + while ((ret = xfarray_iter(array, &i, &rec)) == 1) { + /* do something with rec */ + } + +.. _xfarray_sort: + +Sorting Array Elements +^^^^^^^^^^^^^^^^^^^^^^ + +During the fourth demonstration of online repair, a community reviewer remarked +that for performance reasons, online repair ought to load batches of records +into btree record blocks instead of inserting records into a new btree one at a +time. +The btree insertion code in XFS is responsible for maintaining correct ordering +of the records, so naturally the xfarray must also support sorting the record +set prior to bulk loading. + +Case Study: Sorting xfarrays +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sorting algorithm used in the xfarray is actually a combination of adaptive +quicksort and a heapsort subalgorithm in the spirit of +`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and +`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux +kernel. +To sort records in a reasonably short amount of time, ``xfarray`` takes +advantage of the binary subpartitioning offered by quicksort, but it also uses +heapsort to hedge against performance collapse if the chosen quicksort pivots +are poor. +Both algorithms are (in general) O(n * lg(n)), but there is a wide performance +gulf between the two implementations. + +The Linux kernel already contains a reasonably fast implementation of heapsort. +It only operates on regular C arrays, which limits the scope of its usefulness. +There are two key places where the xfarray uses it: + +* Sorting any record subset backed by a single xfile page. + +* Loading a small number of xfarray records from potentially disparate parts + of the xfarray into a memory buffer, and sorting the buffer. + +In other words, ``xfarray`` uses heapsort to constrain the nested recursion of +quicksort, thereby mitigating quicksort's worst runtime behavior. + +Choosing a quicksort pivot is a tricky business. +A good pivot splits the set to sort in half, leading to the divide and conquer +behavior that is crucial to O(n * lg(n)) performance. +A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) +runtime. +The xfarray sort routine tries to avoid picking a bad pivot by sampling nine +records into a memory buffer and using the kernel heapsort to identify the +median of the nine. + +Most modern quicksort implementations employ Tukey's "ninther" to select a +pivot from a classic C array. +Typical ninther implementations pick three unique triads of records, sort each +of the triads, and then sort the middle value of each triad to determine the +ninther value. +As stated previously, however, xfile accesses are not entirely cheap. +It turned out to be much more performant to read the nine elements into a +memory buffer, run the kernel's in-memory heapsort on the buffer, and choose +the 4th element of that buffer as the pivot. +Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for +low-effort robust (resistant) location in large samples`, in *Contributions to +Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press, +1978), pp. 251–257. + +The partitioning of quicksort is fairly textbook -- rearrange the record +subset around the pivot, then set up the current and next stack frames to +sort with the larger and the smaller halves of the pivot, respectively. +This keeps the stack space requirements to log2(record count). + +As a final performance optimization, the hi and lo scanning phase of quicksort +keeps examined xfile pages mapped in the kernel for as long as possible to +reduce map/unmap cycles. +Surprisingly, this reduces overall sort runtime by nearly half again after +accounting for the application of heapsort directly onto xfile pages. + +.. _xfblob: + +Blob Storage +```````````` + +Extended attributes and directories add an additional requirement for staging +records: arbitrary byte sequences of finite length. +Each directory entry record needs to store entry name, +and each extended attribute needs to store both the attribute name and value. +The names, keys, and values can consume a large amount of memory, so the +``xfblob`` abstraction was created to simplify management of these blobs +atop an xfile. + +Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve +and persist objects. +The store function returns a magic cookie for every object that it persists. +Later, callers provide this cookie to the ``xblob_load`` to recall the object. +The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` +function frees them all because compaction is not needed. + +The details of repairing directories and extended attributes will be discussed +in a subsequent section about atomic extent swapping. +However, it should be noted that these repair functions only use blob storage +to cache a small number of entries before adding them to a temporary ondisk +file, which is why compaction is not required. + +The proposed patchset is at the start of the +`extended attribute repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series. + +.. _xfbtree: + +In-Memory B+Trees +````````````````` + +The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that +checking and repairing of secondary metadata commonly requires coordination +between a live metadata scan of the filesystem and writer threads that are +updating that metadata. +Keeping the scan data up to date requires requires the ability to propagate +metadata updates from the filesystem into the data being collected by the scan. +This *can* be done by appending concurrent updates into a separate log file and +applying them before writing the new metadata to disk, but this leads to +unbounded memory consumption if the rest of the system is very busy. +Another option is to skip the side-log and commit live updates from the +filesystem directly into the scan data, which trades more overhead for a lower +maximum memory requirement. +In both cases, the data structure holding the scan results must support indexed +access to perform well. + +Given that indexed lookups of scan data is required for both strategies, online +fsck employs the second strategy of committing live updates directly into +scan data. +Because xfarrays are not indexed and do not enforce record ordering, they +are not suitable for this task. +Conveniently, however, XFS has a library to create and maintain ordered reverse +mapping records: the existing rmap btree code! +If only there was a means to create one in memory. + +Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a +regular file, which means that the kernel can create byte or block addressable +virtual address spaces at will. +The XFS buffer cache specializes in abstracting IO to block-oriented address +spaces, which means that adaptation of the buffer cache to interface with +xfiles enables reuse of the entire btree library. +Btrees built atop an xfile are collectively known as ``xfbtrees``. +The next few sections describe how they actually work. + +The proposed patchset is the +`in-memory btree +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_ +series. + +Using xfiles as a Buffer Cache Target +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Two modifications are necessary to support xfiles as a buffer cache target. +The first is to make it possible for the ``struct xfs_buftarg`` structure to +host the ``struct xfs_buf`` rhashtable, because normally those are held by a +per-AG structure. +The second change is to modify the buffer ``ioapply`` function to "read" cached +pages from the xfile and "write" cached pages back to the xfile. +Multiple access to individual buffers is controlled by the ``xfs_buf`` lock, +since the xfile does not provide any locking on its own. +With this adaptation in place, users of the xfile-backed buffer cache use +exactly the same APIs as users of the disk-backed buffer cache. +The separation between xfile and buffer cache implies higher memory usage since +they do not share pages, but this property could some day enable transactional +updates to an in-memory btree. +Today, however, it simply eliminates the need for new code. + +Space Management with an xfbtree +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Space management for an xfile is very simple -- each btree block is one memory +page in size. +These blocks use the same header format as an on-disk btree, but the in-memory +block verifiers ignore the checksums, assuming that xfile memory is no more +corruption-prone than regular DRAM. +Reusing existing code here is more important than absolute memory efficiency. + +The very first block of an xfile backing an xfbtree contains a header block. +The header describes the owner, height, and the block number of the root +xfbtree block. + +To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. +If there are no gaps, create one by extending the length of the xfile. +Preallocate space for the block with ``xfile_prealloc``, and hand back the +location. +To free an xfbtree block, use ``xfile_discard`` (which internally uses +``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile. + +Populating an xfbtree +^^^^^^^^^^^^^^^^^^^^^ + +An online fsck function that wants to create an xfbtree should proceed as +follows: + +1. Call ``xfile_create`` to create an xfile. + +2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure + pointing to the xfile. + +3. Pass the buffer cache target, buffer ops, and other information to + ``xfbtree_create`` to write an initial tree header and root block to the + xfile. + Each btree type should define a wrapper that passes necessary arguments to + the creation function. + For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of + all the necessary details for callers. + A ``struct xfbtree`` object will be returned. + +4. Pass the xfbtree object to the btree cursor creation function for the + btree type. + Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this + for callers. + +5. Pass the btree cursor to the regular btree functions to make queries against + and to update the in-memory btree. + For example, a btree cursor for an rmap xfbtree can be passed to the + ``xfs_rmap_*`` functions just like any other btree cursor. + See the :ref:`next section<xfbtree_commit>` for information on dealing with + xfbtree updates that are logged to a transaction. + +6. When finished, delete the btree cursor, destroy the xfbtree object, free the + buffer target, and the destroy the xfile to release all resources. + +.. _xfbtree_commit: + +Committing Logged xfbtree Buffers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Although it is a clever hack to reuse the rmap btree code to handle the staging +structure, the ephemeral nature of the in-memory btree block storage presents +some challenges of its own. +The XFS transaction manager must not commit buffer log items for buffers backed +by an xfile because the log format does not understand updates for devices +other than the data device. +An ephemeral xfbtree probably will not exist by the time the AIL checkpoints +log transactions back into the filesystem, and certainly won't exist during +log recovery. +For these reasons, any code updating an xfbtree in transaction context must +remove the buffer log items from the transaction and write the updates into the +backing xfile before committing or cancelling the transaction. + +The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement +this functionality as follows: + +1. Find each buffer log item whose buffer targets the xfile. + +2. Record the dirty/ordered status of the log item. + +3. Detach the log item from the buffer. + +4. Queue the buffer to a special delwri list. + +5. Clear the transaction dirty flag if the only dirty log items were the ones + that were detached in step 3. + +6. Submit the delwri list to commit the changes to the xfile, if the updates + are being committed. + +After removing xfile logged buffers from the transaction in this manner, the +transaction can be committed or cancelled. + +Bulk Loading of Ondisk B+Trees +------------------------------ + +As mentioned previously, early iterations of online repair built new btree +structures by creating a new btree and adding observations individually. +Loading a btree one record at a time had a slight advantage of not requiring +the incore records to be sorted prior to commit, but was very slow and leaked +blocks if the system went down during a repair. +Loading records one at a time also meant that repair could not control the +loading factor of the blocks in the new btree. + +Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for +rebuilding a btree index from a collection of records -- bulk btree loading. +This was implemented rather inefficiently code-wise, since ``xfs_repair`` +had separate copy-pasted implementations for each btree type. + +To prepare for online fsck, each of the four bulk loaders were studied, notes +were taken, and the four were refactored into a single generic btree bulk +loading mechanism. +Those notes in turn have been refreshed and are presented below. + +Geometry Computation +```````````````````` + +The zeroth step of bulk loading is to assemble the entire record set that will +be stored in the new btree, and sort the records. +Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the +btree from the record set, the type of btree, and any load factor preferences. +This information is required for resource reservation. + +First, the geometry computation computes the minimum and maximum records that +will fit in a leaf block from the size of a btree block and the size of the +block header. +Roughly speaking, the maximum number of records is:: + + maxrecs = (block_size - header_size) / record_size + +The XFS design specifies that btree blocks should be merged when possible, +which means the minimum number of records is half of maxrecs:: + + minrecs = maxrecs / 2 + +The next variable to determine is the desired loading factor. +This must be at least minrecs and no more than maxrecs. +Choosing minrecs is undesirable because it wastes half the block. +Choosing maxrecs is also undesirable because adding a single record to each +newly rebuilt leaf block will cause a tree split, which causes a noticeable +drop in performance immediately afterwards. +The default loading factor was chosen to be 75% of maxrecs, which provides a +reasonably compact structure without any immediate split penalties:: + + default_load_factor = (maxrecs + minrecs) / 2 + +If space is tight, the loading factor will be set to maxrecs to try to avoid +running out of space:: + + leaf_load_factor = enough space ? default_load_factor : maxrecs + +Load factor is computed for btree node blocks using the combined size of the +btree key and pointer as the record size:: + + maxrecs = (block_size - header_size) / (key_size + ptr_size) + minrecs = maxrecs / 2 + node_load_factor = enough space ? default_load_factor : maxrecs + +Once that's done, the number of leaf blocks required to store the record set +can be computed as:: + + leaf_blocks = ceil(record_count / leaf_load_factor) + +The number of node blocks needed to point to the next level down in the tree +is computed as:: + + n_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) + node_blocks[n + 1] = ceil(n_blocks / node_load_factor) + +The entire computation is performed recursively until the current level only +needs one block. +The resulting geometry is as follows: + +- For AG-rooted btrees, this level is the root level, so the height of the new + tree is ``level + 1`` and the space needed is the summation of the number of + blocks on each level. + +- For inode-rooted btrees where the records in the top level do not fit in the + inode fork area, the height is ``level + 2``, the space needed is the + summation of the number of blocks on each level, and the inode fork points to + the root block. + +- For inode-rooted btrees where the records in the top level can be stored in + the inode fork area, then the root block can be stored in the inode, the + height is ``level + 1``, and the space needed is one less than the summation + of the number of blocks on each level. + This only becomes relevant when non-bmap btrees gain the ability to root in + an inode, which is a future patchset and only included here for completeness. + +.. _newbt: + +Reserving New B+Tree Blocks +``````````````````````````` + +Once repair knows the number of blocks needed for the new btree, it allocates +those blocks using the free space information. +Each reserved extent is tracked separately by the btree builder state data. +To improve crash resilience, the reservation code also logs an Extent Freeing +Intent (EFI) item in the same transaction as each space allocation and attaches +its in-memory ``struct xfs_extent_free_item`` object to the space reservation. +If the system goes down, log recovery will use the unfinished EFIs to free the +unused space, the free space, leaving the filesystem unchanged. + +Each time the btree builder claims a block for the btree from a reserved +extent, it updates the in-memory reservation to reflect the claimed space. +Block reservation tries to allocate as much contiguous space as possible to +reduce the number of EFIs in play. + +While repair is writing these new btree blocks, the EFIs created for the space +reservations pin the tail of the ondisk log. +It's possible that other parts of the system will remain busy and push the head +of the log towards the pinned tail. +To avoid livelocking the filesystem, the EFIs must not pin the tail of the log +for too long. +To alleviate this problem, the dynamic relogging capability of the deferred ops +mechanism is reused here to commit a transaction at the log head containing an +EFD for the old EFI and new EFI at the head. +This enables the log to release the old EFI to keep the log moving forwards. + +EFIs have a role to play during the commit and reaping phases; please see the +next section and the section about :ref:`reaping<reaping>` for more details. + +Proposed patchsets are the +`bitmap rework +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_ +and the +`preparation for bulk loading btrees +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_. + + +Writing the New Tree +```````````````````` + +This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims +a block from the reserved list, writes the new btree block header, fills the +rest of the block with records, and adds the new leaf block to a list of +written blocks:: + + ┌────┐ + │leaf│ + │RRR │ + └────┘ + +Sibling pointers are set every time a new block is added to the level:: + + ┌────┐ ┌────┐ ┌────┐ ┌────┐ + │leaf│→│leaf│→│leaf│→│leaf│ + │RRR │←│RRR │←│RRR │←│RRR │ + └────┘ └────┘ └────┘ └────┘ + +When it finishes writing the record leaf blocks, it moves on to the node +blocks +To fill a node block, it walks each block in the next level down in the tree +to compute the relevant keys and write them into the parent node:: + + ┌────┐ ┌────┐ + │node│──────→│node│ + │PP │←──────│PP │ + └────┘ └────┘ + ↙ ↘ ↙ ↘ + ┌────┐ ┌────┐ ┌────┐ ┌────┐ + │leaf│→│leaf│→│leaf│→│leaf│ + │RRR │←│RRR │←│RRR │←│RRR │ + └────┘ └────┘ └────┘ └────┘ + +When it reaches the root level, it is ready to commit the new btree!:: + + ┌─────────┐ + │ root │ + │ PP │ + └─────────┘ + ↙ ↘ + ┌────┐ ┌────┐ + │node│──────→│node│ + │PP │←──────│PP │ + └────┘ └────┘ + ↙ ↘ ↙ ↘ + ┌────┐ ┌────┐ ┌────┐ ┌────┐ + │leaf│→│leaf│→│leaf│→│leaf│ + │RRR │←│RRR │←│RRR │←│RRR │ + └────┘ └────┘ └────┘ └────┘ + +The first step to commit the new btree is to persist the btree blocks to disk +synchronously. +This is a little complicated because a new btree block could have been freed +in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to +remove the (stale) buffer from the AIL list before it can write the new blocks +to disk. +Blocks are queued for IO using a delwri list and written in one large batch +with ``xfs_buf_delwri_submit``. + +Once the new blocks have been persisted to disk, control returns to the +individual repair function that called the bulk loader. +The repair function must log the location of the new root in a transaction, +clean up the space reservations that were made for the new btree, and reap the +old metadata blocks: + +1. Commit the location of the new btree root. + +2. For each incore reservation: + + a. Log Extent Freeing Done (EFD) items for all the space that was consumed + by the btree builder. The new EFDs must point to the EFIs attached to + the reservation to prevent log recovery from freeing the new blocks. + + b. For unclaimed portions of incore reservations, create a regular deferred + extent free work item to be free the unused space later in the + transaction chain. + + c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the + reservation of the committing transaction. + If the btree loading code suspects this might be about to happen, it must + call ``xrep_defer_finish`` to clear out the deferred work and obtain a + fresh transaction. + +3. Clear out the deferred work a second time to finish the commit and clean + the repair transaction. + +The transaction rolling in steps 2c and 3 represent a weakness in the repair +algorithm, because a log flush and a crash before the end of the reap step can +result in space leaking. +Online repair functions minimize the chances of this occurring by using very +large transactions, which each can accommodate many thousands of block freeing +instructions. +Repair moves on to reaping the old blocks, which will be presented in a +subsequent :ref:`section<reaping>` after a few case studies of bulk loading. + +Case Study: Rebuilding the Inode Index +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The high level process to rebuild the inode index btree is: + +1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec`` + records from the inode chunk information and a bitmap of the old inode btree + blocks. + +2. Append the records to an xfarray in inode order. + +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for the inode btree. + If the free space inode btree is enabled, call it again to estimate the + geometry of the finobt. + +4. Allocate the number of blocks computed in the previous step. + +5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks. + If the free space inode btree is enabled, call it again to load the finobt. + +6. Commit the location of the new btree root block(s) to the AGI. + +7. Reap the old btree blocks using the bitmap created in step 1. + +Details are as follows. + +The inode btree maps inumbers to the ondisk location of the associated +inode records, which means that the inode btrees can be rebuilt from the +reverse mapping information. +Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the +location of the old inode btree blocks. +Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the +location of at least one inode cluster buffer. +A cluster is the smallest number of ondisk inodes that can be allocated or +freed in a single transaction; it is never smaller than 1 fs block or 4 inodes. + +For the space represented by each inode cluster, ensure that there are no +records in the free space btrees nor any records in the reference count btree. +If there are, the space metadata inconsistencies are reason enough to abort the +operation. +Otherwise, read each cluster buffer to check that its contents appear to be +ondisk inodes and to decide if the file is allocated +(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). +Accumulate the results of successive inode cluster buffer reads until there is +enough information to fill a single inode chunk record, which is 64 consecutive +numbers in the inumber keyspace. +If the chunk is sparse, the chunk record may include holes. + +Once the repair function accumulates one chunk's worth of data, it calls +``xfarray_append`` to add the inode btree record to the xfarray. +This xfarray is walked twice during the btree creation step -- once to populate +the inode btree with all inode chunk records, and a second time to populate the +free inode btree with records for chunks that have free non-sparse inodes. +The number of records for the inode btree is the number of xfarray records, +but the record count for the free inode btree has to be computed as inode chunk +records are stored in the xfarray. + +The proposed patchset is the +`AG btree repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ +series. + +Case Study: Rebuilding the Space Reference Counts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Reverse mapping records are used to rebuild the reference count information. +Reference counts are required for correct operation of copy on write for shared +file data. +Imagine the reverse mapping entries as rectangles representing extents of +physical blocks, and that the rectangles can be laid down to allow them to +overlap each other. +From the diagram below, it is apparent that a reference count record must start +or end wherever the height of the stack changes. +In other words, the record emission stimulus is level-triggered:: + + █ ███ + ██ █████ ████ ███ ██████ + ██ ████ ███████████ ████ █████████ + ████████████████████████████████ ███████████ + ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ + 2 1 23 21 3 43 234 2123 1 01 2 3 0 + +The ondisk reference count btree does not store the refcount == 0 cases because +the free space btree already records which blocks are free. +Extents being used to stage copy-on-write operations should be the only records +with refcount == 1. +Single-owner file blocks aren't recorded in either the free space or the +reference count btrees. + +The high level process to rebuild the reference count btree is: + +1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec`` + records for any space having more than one reverse mapping and add them to + the xfarray. + Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray + because these are extents allocated to stage a copy on write operation and + are tracked in the refcount btree. + + Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old + refcount btree blocks. + +2. Sort the records in physical extent order, putting the CoW staging extents + at the end of the xfarray. + This matches the sorting order of records in the refcount btree. + +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for the new tree. + +4. Allocate the number of blocks computed in the previous step. + +5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks. + +6. Commit the location of new btree root block to the AGF. + +7. Reap the old btree blocks using the bitmap created in step 1. + +Details are as follows; the same algorithm is used by ``xfs_repair`` to +generate refcount information from reverse mapping records. + +- Until the reverse mapping btree runs out of records: + + - Retrieve the next record from the btree and put it in a bag. + + - Collect all records with the same starting block from the btree and put + them in the bag. + + - While the bag isn't empty: + + - Among the mappings in the bag, compute the lowest block number where the + reference count changes. + This position will be either the starting block number of the next + unprocessed reverse mapping or the next block after the shortest mapping + in the bag. + + - Remove all mappings from the bag that end at this position. + + - Collect all reverse mappings that start at this position from the btree + and put them in the bag. + + - If the size of the bag changed and is greater than one, create a new + refcount record associating the block number range that we just walked to + the size of the bag. + +The bag-like structure in this case is a type 2 xfarray as discussed in the +:ref:`xfarray access patterns<xfarray_access_patterns>` section. +Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and +removed via ``xfarray_unset``. +Bag members are examined through ``xfarray_iter`` loops. + +The proposed patchset is the +`AG btree repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ +series. + +Case Study: Rebuilding File Fork Mapping Indices +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The high level process to rebuild a data/attr fork mapping btree is: + +1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec`` + records from the reverse mapping records for that inode and fork. + Append these records to an xfarray. + Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` + records. + +2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for the new tree. + +3. Sort the records in file offset order. + +4. If the extent records would fit in the inode fork immediate area, commit the + records to that immediate area and skip to step 8. + +5. Allocate the number of blocks computed in the previous step. + +6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks. + +7. Commit the new btree root block to the inode fork immediate area. + +8. Reap the old btree blocks using the bitmap created in step 1. + +There are some complications here: +First, it's possible to move the fork offset to adjust the sizes of the +immediate areas if the data and attr forks are not both in BMBT format. +Second, if there are sufficiently few fork mappings, it may be possible to use +EXTENTS format instead of BMBT, which may require a conversion. +Third, the incore extent map must be reloaded carefully to avoid disturbing +any delayed allocation extents. + +The proposed patchset is the +`file mapping repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_ +series. + +.. _reaping: + +Reaping Old Metadata Blocks +--------------------------- + +Whenever online fsck builds a new data structure to replace one that is +suspect, there is a question of how to find and dispose of the blocks that +belonged to the old structure. +The laziest method of course is not to deal with them at all, but this slowly +leads to service degradations as space leaks out of the filesystem. +Hopefully, someone will schedule a rebuild of the free space information to +plug all those leaks. +Offline repair rebuilds all space metadata after recording the usage of +the files and directories that it decides not to clear, hence it can build new +structures in the discovered free space and avoid the question of reaping. + +As part of a repair, online fsck relies heavily on the reverse mapping records +to find space that is owned by the corresponding rmap owner yet truly free. +Cross referencing rmap records with other rmap records is necessary because +there may be other data structures that also think they own some of those +blocks (e.g. crosslinked trees). +Permitting the block allocator to hand them out again will not push the system +towards consistency. + +For space metadata, the process of finding extents to dispose of generally +follows this format: + +1. Create a bitmap of space used by data structures that must be preserved. + The space reservations used to create the new metadata can be used here if + the same rmap owner code is used to denote all of the objects being rebuilt. + +2. Survey the reverse mapping data to create a bitmap of space owned by the + same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. + +3. Use the bitmap disunion operator to subtract (1) from (2). + The remaining set bits represent candidate extents that could be freed. + The process moves on to step 4 below. + +Repairs for file-based metadata such as extended attributes, directories, +symbolic links, quota files and realtime bitmaps are performed by building a +new structure attached to a temporary file and swapping the forks. +Afterward, the mappings in the old file fork are the candidate blocks for +disposal. + +The process for disposing of old extents is as follows: + +4. For each candidate extent, count the number of reverse mapping records for + the first block in that extent that do not have the same rmap owner for the + data structure being repaired. + + - If zero, the block has a single owner and can be freed. + + - If not, the block is part of a crosslinked structure and must not be + freed. + +5. Starting with the next block in the extent, figure out how many more blocks + have the same zero/nonzero other owner status as that first block. + +6. If the region is crosslinked, delete the reverse mapping entry for the + structure being repaired and move on to the next region. + +7. If the region is to be freed, mark any corresponding buffers in the buffer + cache as stale to prevent log writeback. + +8. Free the region and move on. + +However, there is one complication to this procedure. +Transactions are of finite size, so the reaping process must be careful to roll +the transactions to avoid overruns. +Overruns come from two sources: + +a. EFIs logged on behalf of space that is no longer occupied + +b. Log items for buffer invalidations + +This is also a window in which a crash during the reaping process can leak +blocks. +As stated earlier, online repair functions use very large transactions to +minimize the chances of this occurring. + +The proposed patchset is the +`preparation for bulk loading btrees +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_ +series. + +Case Study: Reaping After a Regular Btree Repair +```````````````````````````````````````````````` + +Old reference count and inode btrees are the easiest to reap because they have +rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount +btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. +Creating a list of extents to reap the old btree blocks is quite simple, +conceptually: + +1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees. + +2. For each reverse mapping record with an rmap owner corresponding to the + metadata structure being rebuilt, set the corresponding range in a bitmap. + +3. Walk the current data structures that have the same rmap owner. + For each block visited, clear that range in the above bitmap. + +4. Each set bit in the bitmap represents a block that could be a block from the + old data structures and hence is a candidate for reaping. + In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` + are the blocks that might be freeable. + +If it is possible to maintain the AGF lock throughout the repair (which is the +common case), then step 2 can be performed at the same time as the reverse +mapping record walk that creates the records for the new btree. + +Case Study: Rebuilding the Free Space Indices +````````````````````````````````````````````` + +The high level process to rebuild the free space indices is: + +1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore`` + records from the gaps in the reverse mapping btree. + +2. Append the records to an xfarray. + +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for each new tree. + +4. Allocate the number of blocks computed in the previous step from the free + space information collected. + +5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks for the free space by length index. + Call it again for the free space by block number index. + +6. Commit the locations of the new btree root blocks to the AGF. + +7. Reap the old btree blocks by looking for space that is not recorded by the + reverse mapping btree, the new free space btrees, or the AGFL. + +Repairing the free space btrees has three key complications over a regular +btree repair: + +First, free space is not explicitly tracked in the reverse mapping records. +Hence, the new free space records must be inferred from gaps in the physical +space component of the keyspace of the reverse mapping btree. + +Second, free space repairs cannot use the common btree reservation code because +new blocks are reserved out of the free space btrees. +This is impossible when repairing the free space btrees themselves. +However, repair holds the AGF buffer lock for the duration of the free space +index reconstruction, so it can use the collected free space information to +supply the blocks for the new free space btrees. +It is not necessary to back each reserved extent with an EFI because the new +free space btrees are constructed in what the ondisk filesystem thinks is +unowned space. +However, if reserving blocks for the new btrees from the collected free space +information changes the number of free space records, repair must re-estimate +the new free space btree geometry with the new record count until the +reservation is sufficient. +As part of committing the new btrees, repair must ensure that reverse mappings +are created for the reserved blocks and that unused reserved blocks are +inserted into the free space btrees. +Deferrred rmap and freeing operations are used to ensure that this transition +is atomic, similar to the other btree repair functions. + +Third, finding the blocks to reap after the repair is not overly +straightforward. +Blocks for the free space btrees and the reverse mapping btrees are supplied by +the AGFL. +Blocks put onto the AGFL have reverse mapping records with the owner +``XFS_RMAP_OWN_AG``. +This ownership is retained when blocks move from the AGFL into the free space +btrees or the reverse mapping btrees. +When repair walks reverse mapping records to synthesize free space records, it +creates a bitmap (``ag_owner_bitmap``) of all the space claimed by +``XFS_RMAP_OWN_AG`` records. +The repair context maintains a second bitmap corresponding to the rmap btree +blocks and the AGFL blocks (``rmap_agfl_bitmap``). +When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & +~rmap_agfl_bitmap)`` computes the extents that are used by the old free space +btrees. +These blocks can then be reaped using the methods outlined above. + +The proposed patchset is the +`AG btree repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ +series. + +.. _rmap_reap: + +Case Study: Reaping After Repairing Reverse Mapping Btrees +`````````````````````````````````````````````````````````` + +Old reverse mapping btrees are less difficult to reap after a repair. +As mentioned in the previous section, blocks on the AGFL, the two free space +btree blocks, and the reverse mapping btree blocks all have reverse mapping +records with ``XFS_RMAP_OWN_AG`` as the owner. +The full process of gathering reverse mapping records and building a new btree +are described in the case study of +:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that +discussion is that the new rmap btree will not contain any records for the old +rmap btree, nor will the old btree blocks be tracked in the free space btrees. +The list of candidate reaping blocks is computed by setting the bits +corresponding to the gaps in the new rmap btree records, and then clearing the +bits corresponding to extents in the free space btrees and the current AGFL +blocks. +The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the +methods outlined above. + +The rest of the process of rebuildng the reverse mapping btree is discussed +in a separate :ref:`case study<rmap_repair>`. + +The proposed patchset is the +`AG btree repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ +series. + +Case Study: Rebuilding the AGFL +``````````````````````````````` + +The allocation group free block list (AGFL) is repaired as follows: + +1. Create a bitmap for all the space that the reverse mapping data claims is + owned by ``XFS_RMAP_OWN_AG``. + +2. Subtract the space used by the two free space btrees and the rmap btree. + +3. Subtract any space that the reverse mapping data claims is owned by any + other owner, to avoid re-adding crosslinked blocks to the AGFL. + +4. Once the AGFL is full, reap any blocks leftover. + +5. The next operation to fix the freelist will right-size the list. + +See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details. + +Inode Record Repairs +-------------------- + +Inode records must be handled carefully, because they have both ondisk records +("dinodes") and an in-memory ("cached") representation. +There is a very high potential for cache coherency issues if online fsck is not +careful to access the ondisk metadata *only* when the ondisk metadata is so +badly damaged that the filesystem cannot load the in-memory representation. +When online fsck wants to open a damaged file for scrubbing, it must use +specialized resource acquisition functions that return either the in-memory +representation *or* a lock on whichever object is necessary to prevent any +update to the ondisk location. + +The only repairs that should be made to the ondisk inode buffers are whatever +is necessary to get the in-core structure loaded. +This means fixing whatever is caught by the inode cluster buffer and inode fork +verifiers, and retrying the ``iget`` operation. +If the second ``iget`` fails, the repair has failed. + +Once the in-memory representation is loaded, repair can lock the inode and can +subject it to comprehensive checks, repairs, and optimizations. +Most inode attributes are easy to check and constrain, or are user-controlled +arbitrary bit patterns; these are both easy to fix. +Dealing with the data and attr fork extent counts and the file block counts is +more complicated, because computing the correct value requires traversing the +forks, or if that fails, leaving the fields invalid and waiting for the fork +fsck functions to run. + +The proposed patchset is the +`inode +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_ +repair series. + +Quota Record Repairs +-------------------- + +Similar to inodes, quota records ("dquots") also have both ondisk records and +an in-memory representation, and hence are subject to the same cache coherency +issues. +Somewhat confusingly, both are known as dquots in the XFS codebase. + +The only repairs that should be made to the ondisk quota record buffers are +whatever is necessary to get the in-core structure loaded. +Once the in-memory representation is loaded, the only attributes needing +checking are obviously bad limits and timer values. + +Quota usage counters are checked, repaired, and discussed separately in the +section about :ref:`live quotacheck <quotacheck>`. + +The proposed patchset is the +`quota +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_ +repair series. + +.. _fscounters: + +Freezing to Fix Summary Counters +-------------------------------- + +Filesystem summary counters track availability of filesystem resources such +as free blocks, free inodes, and allocated inodes. +This information could be compiled by walking the free space and inode indexes, +but this is a slow process, so XFS maintains a copy in the ondisk superblock +that should reflect the ondisk metadata, at least when the filesystem has been +unmounted cleanly. +For performance reasons, XFS also maintains incore copies of those counters, +which are key to enabling resource reservations for active transactions. +Writer threads reserve the worst-case quantities of resources from the +incore counter and give back whatever they don't use at commit time. +It is therefore only necessary to serialize on the superblock when the +superblock is being committed to disk. + +The lazy superblock counter feature introduced in XFS v5 took this even further +by training log recovery to recompute the summary counters from the AG headers, +which eliminated the need for most transactions even to touch the superblock. +The only time XFS commits the summary counters is at filesystem unmount. +To reduce contention even further, the incore counter is implemented as a +percpu counter, which means that each CPU is allocated a batch of blocks from a +global incore counter and can satisfy small allocations from the local batch. + +The high-performance nature of the summary counters makes it difficult for +online fsck to check them, since there is no way to quiesce a percpu counter +while the system is running. +Although online fsck can read the filesystem metadata to compute the correct +values of the summary counters, there's no way to hold the value of a percpu +counter stable, so it's quite possible that the counter will be out of date by +the time the walk is complete. +Earlier versions of online scrub would return to userspace with an incomplete +scan flag, but this is not a satisfying outcome for a system administrator. +For repairs, the in-memory counters must be stabilized while walking the +filesystem metadata to get an accurate reading and install it in the percpu +counter. + +To satisfy this requirement, online fsck must prevent other programs in the +system from initiating new writes to the filesystem, it must disable background +garbage collection threads, and it must wait for existing writer programs to +exit the kernel. +Once that has been established, scrub can walk the AG free space indexes, the +inode btrees, and the realtime bitmap to compute the correct value of all +four summary counters. +This is very similar to a filesystem freeze, though not all of the pieces are +necessary: + +- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to + prevent other threads from thawing the filesystem, or other scrub threads + from initiating another fscounters freeze. + +- It does not quiesce the log. + +With this code in place, it is now possible to pause the filesystem for just +long enough to check and correct the summary counters. + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| The initial implementation used the actual VFS filesystem freeze | +| mechanism to quiesce filesystem activity. | +| With the filesystem frozen, it is possible to resolve the counter values | +| with exact precision, but there are many problems with calling the VFS | +| methods directly: | +| | +| - Other programs can unfreeze the filesystem without our knowledge. | +| This leads to incorrect scan results and incorrect repairs. | +| | +| - Adding an extra lock to prevent others from thawing the filesystem | +| required the addition of a ``->freeze_super`` function to wrap | +| ``freeze_fs()``. | +| This in turn caused other subtle problems because it turns out that | +| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the | +| last reference to the VFS superblock, and any subsequent access | +| becomes a UAF bug! | +| This can happen if the filesystem is unmounted while the underlying | +| block device has frozen the filesystem. | +| This problem could be solved by grabbing extra references to the | +| superblock, but it felt suboptimal given the other inadequacies of | +| this approach. | +| | +| - The log need not be quiesced to check the summary counters, but a VFS | +| freeze initiates one anyway. | +| This adds unnecessary runtime to live fscounter fsck operations. | +| | +| - Quiescing the log means that XFS flushes the (possibly incorrect) | +| counters to disk as part of cleaning the log. | +| | +| - A bug in the VFS meant that freeze could complete even when | +| sync_filesystem fails to flush the filesystem and returns an error. | +| This bug was fixed in Linux 5.17. | ++--------------------------------------------------------------------------+ + +The proposed patchset is the +`summary counter cleanup +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_ +series. + +Full Filesystem Scans +--------------------- + +Certain types of metadata can only be checked by walking every file in the +entire filesystem to record observations and comparing the observations against +what's recorded on disk. +Like every other type of online repair, repairs are made by writing those +observations to disk in a replacement structure and committing it atomically. +However, it is not practical to shut down the entire filesystem to examine +hundreds of billions of files because the downtime would be excessive. +Therefore, online fsck must build the infrastructure to manage a live scan of +all the files in the filesystem. +There are two questions that need to be solved to perform a live walk: + +- How does scrub manage the scan while it is collecting data? + +- How does the scan keep abreast of changes being made to the system by other + threads? + +.. _iscan: + +Coordinated Inode Scans +``````````````````````` + +In the original Unix filesystems of the 1970s, each directory entry contained +an index number (*inumber*) which was used as an index into on ondisk array +(*itable*) of fixed-size records (*inodes*) describing a file's attributes and +its data block mapping. +This system is described by J. Lions, `"inode (5659)" +<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on +UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South +Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, +`"Implementation of the File System" +<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX +Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp. +1913-4. + +XFS retains most of this design, except now inumbers are search keys over all +the space in the data section filesystem. +They form a continuous keyspace that can be expressed as a 64-bit integer, +though the inodes themselves are sparsely distributed within the keyspace. +Scans proceed in a linear fashion across the inumber keyspace, starting from +``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. +Naturally, a scan through a keyspace requires a scan cursor object to track the +scan progress. +Because this keyspace is sparse, this cursor contains two parts. +The first part of this scan cursor object tracks the inode that will be +examined next; call this the examination cursor. +Somewhat less obviously, the scan cursor object must also track which parts of +the keyspace have already been visited, which is critical for deciding if a +concurrent filesystem update needs to be incorporated into the scan data. +Call this the visited inode cursor. + +Advancing the scan cursor is a multi-step process encapsulated in +``xchk_iscan_iter``: + +1. Lock the AGI buffer of the AG containing the inode pointed to by the visited + inode cursor. + This guarantee that inodes in this AG cannot be allocated or freed while + advancing the cursor. + +2. Use the per-AG inode btree to look up the next inumber after the one that + was just visited, since it may not be keyspace adjacent. + +3. If there are no more inodes left in this AG: + + a. Move the examination cursor to the point of the inumber keyspace that + corresponds to the start of the next AG. + + b. Adjust the visited inode cursor to indicate that it has "visited" the + last possible inode in the current AG's inode keyspace. + XFS inumbers are segmented, so the cursor needs to be marked as having + visited the entire keyspace up to just before the start of the next AG's + inode keyspace. + + c. Unlock the AGI and return to step 1 if there are unexamined AGs in the + filesystem. + + d. If there are no more AGs to examine, set both cursors to the end of the + inumber keyspace. + The scan is now complete. + +4. Otherwise, there is at least one more inode to scan in this AG: + + a. Move the examination cursor ahead to the next inode marked as allocated + by the inode btree. + + b. Adjust the visited inode cursor to point to the inode just prior to where + the examination cursor is now. + Because the scanner holds the AGI buffer lock, no inodes could have been + created in the part of the inode keyspace that the visited inode cursor + just advanced. + +5. Get the incore inode for the inumber of the examination cursor. + By maintaining the AGI buffer lock until this point, the scanner knows that + it was safe to advance the examination cursor across the entire keyspace, + and that it has stabilized this next inode so that it cannot disappear from + the filesystem until the scan releases the incore inode. + +6. Drop the AGI lock and return the incore inode to the caller. + +Online fsck functions scan all files in the filesystem as follows: + +1. Start a scan by calling ``xchk_iscan_start``. + +2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode. + If one is provided: + + a. Lock the inode to prevent updates during the scan. + + b. Scan the inode. + + c. While still holding the inode lock, adjust the visited inode cursor + (``xchk_iscan_mark_visited``) to point to this inode. + + d. Unlock and release the inode. + +8. Call ``xchk_iscan_teardown`` to complete the scan. + +There are subtleties with the inode cache that complicate grabbing the incore +inode for the caller. +Obviously, it is an absolute requirement that the inode metadata be consistent +enough to load it into the inode cache. +Second, if the incore inode is stuck in some intermediate state, the scan +coordinator must release the AGI and push the main filesystem to get the inode +back into a loadable state. + +The proposed patches are the +`inode scanner +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ +series. +The first user of the new functionality is the +`online quotacheck +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ +series. + +Inode Management +```````````````` + +In regular filesystem code, references to allocated XFS incore inodes are +always obtained (``xfs_iget``) outside of transaction context because the +creation of the incore context for an existing file does not require metadata +updates. +However, it is important to note that references to incore inodes obtained as +part of file creation must be performed in transaction context because the +filesystem must ensure the atomicity of the ondisk inode btree index updates +and the initialization of the actual ondisk inode. + +References to incore inodes are always released (``xfs_irele``) outside of +transaction context because there are a handful of activities that might +require ondisk updates: + +- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode + release. + +- Speculative preallocations need to be unreserved. + +- An unlinked file may have lost its last reference, in which case the entire + file must be inactivated, which involves releasing all of its resources in + the ondisk metadata and freeing the inode. + +These activities are collectively called inode inactivation. +Inactivation has two parts -- the VFS part, which initiates writeback on all +dirty file pages, and the XFS part, which cleans up XFS-specific information +and frees the inode if it was unlinked. +If the inode is unlinked (or unconnected after a file handle operation), the +kernel drops the inode into the inactivation machinery immediately. + +During normal operation, resource acquisition for an update follows this order +to avoid deadlocks: + +1. Inode reference (``iget``). + +2. Filesystem freeze protection, if repairing (``mnt_want_write_file``). + +3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. + +4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that + can update page cache mappings. + +5. Log feature enablement. + +6. Transaction log space grant. + +7. Space on the data and realtime devices for the transaction. + +8. Incore dquot references, if a file is being repaired. + Note that they are not locked, merely acquired. + +9. Inode ``ILOCK`` for file metadata updates. + +10. AG header buffer locks / Realtime metadata inode ILOCK. + +11. Realtime metadata buffer locks, if applicable. + +12. Extent mapping btree blocks, if applicable. + +Resources are often released in the reverse order, though this is not required. +However, online fsck differs from regular XFS operations because it may examine +an object that normally is acquired in a later stage of the locking order, and +then decide to cross-reference the object with an object that is acquired +earlier in the order. +The next few sections detail the specific ways in which online fsck takes care +to avoid deadlocks. + +iget and irele During a Scrub +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An inode scan performed on behalf of a scrub operation runs in transaction +context, and possibly with resources already locked and bound to it. +This isn't much of a problem for ``iget`` since it can operate in the context +of an existing transaction, as long as all of the bound resources are acquired +before the inode reference in the regular filesystem. + +When the VFS ``iput`` function is given a linked inode with no other +references, it normally puts the inode on an LRU list in the hope that it can +save time if another process re-opens the file before the system runs out +of memory and frees it. +Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE`` +flag on the inode to cause the kernel to try to drop the inode into the +inactivation machinery immediately. + +In the past, inactivation was always done from the process that dropped the +inode, which was a problem for scrub because scrub may already hold a +transaction, and XFS does not support nesting transactions. +On the other hand, if there is no scrub transaction, it is desirable to drop +otherwise unused inodes immediately to avoid polluting caches. +To capture these nuances, the online fsck code has a separate ``xchk_irele`` +function to set or clear the ``DONTCACHE`` flag to get the required release +behavior. + +Proposed patchsets include fixing +`scrub iget usage +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and +`dir iget usage +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_. + +.. _ilocking: + +Locking Inodes +^^^^^^^^^^^^^^ + +In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks +in a well-known order: parent → child when updating the directory tree, and +in numerical order of the addresses of their ``struct inode`` object otherwise. +For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page +faults. +If two MMAPLOCKs must be acquired, they are acquired in numerical order of +the addresses of their ``struct address_space`` objects. +Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be +acquired before transactions are allocated. +If two ILOCKs must be acquired, they are acquired in inumber order. + +Inode lock acquisition must be done carefully during a coordinated inode scan. +Online fsck cannot abide these conventions, because for a directory tree +scanner, the scrub process holds the IOLOCK of the file being scanned and it +needs to take the IOLOCK of the file at the other end of the directory link. +If the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` +cannot use the regular inode locking functions and avoid becoming trapped in an +ABBA deadlock. + +Solving both of these problems is straightforward -- any time online fsck +needs to take a second lock of the same class, it uses trylock to avoid an ABBA +deadlock. +If the trylock fails, scrub drops all inode locks and use trylock loops to +(re)acquire all necessary resources. +Trylock loops enable scrub to check for pending fatal signals, which is how +scrub avoids deadlocking the filesystem or becoming an unresponsive process. +However, trylock loops means that online fsck must be prepared to measure the +resource being scrubbed before and after the lock cycle to detect changes and +react accordingly. + +.. _dirparent: + +Case Study: Finding a Directory Parent +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Consider the directory parent pointer repair code as an example. +Online fsck must verify that the dotdot dirent of a directory points up to a +parent directory, and that the parent directory contains exactly one dirent +pointing down to the child directory. +Fully validating this relationship (and repairing it if possible) requires a +walk of every directory on the filesystem while holding the child locked, and +while updates to the directory tree are being made. +The coordinated inode scan provides a way to walk the filesystem without the +possibility of missing an inode. +The child directory is kept locked to prevent updates to the dotdot dirent, but +if the scanner fails to lock a parent, it can drop and relock both the child +and the prospective parent. +If the dotdot entry changes while the directory is unlocked, then a move or +rename operation must have changed the child's parentage, and the scan can +exit early. + +The proposed patchset is the +`directory repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ +series. + +.. _fshooks: + +Filesystem Hooks +````````````````` + +The second piece of support that online fsck functions need during a full +filesystem scan is the ability to stay informed about updates being made by +other threads in the filesystem, since comparisons against the past are useless +in a dynamic environment. +Two pieces of Linux kernel infrastructure enable online fsck to monitor regular +filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`. + +Filesystem hooks convey information about an ongoing filesystem operation to +a downstream consumer. +In this case, the downstream consumer is always an online fsck function. +Because multiple fsck functions can run in parallel, online fsck uses the Linux +notifier call chain facility to dispatch updates to any number of interested +fsck processes. +Call chains are a dynamic list, which means that they can be configured at +run time. +Because these hooks are private to the XFS module, the information passed along +contains exactly what the checking function needs to update its observations. + +The current implementation of XFS hooks uses SRCU notifier chains to reduce the +impact to highly threaded workloads. +Regular blocking notifier chains use a rwsem and seem to have a much lower +overhead for single-threaded applications. +However, it may turn out that the combination of blocking chains and static +keys are a more performant combination; more study is needed here. + +The following pieces are necessary to hook a certain point in the filesystem: + +- A ``struct xfs_hooks`` object must be embedded in a convenient place such as + a well-known incore filesystem object. + +- Each hook must define an action code and a structure containing more context + about the action. + +- Hook providers should provide appropriate wrapper functions and structs + around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type + checking to ensure correct usage. + +- A callsite in the regular filesystem code must be chosen to call + ``xfs_hooks_call`` with the action code and data structure. + This place should be adjacent to (and not earlier than) the place where + the filesystem update is committed to the transaction. + In general, when the filesystem calls a hook chain, it should be able to + handle sleeping and should not be vulnerable to memory reclaim or locking + recursion. + However, the exact requirements are very dependent on the context of the hook + caller and the callee. + +- The online fsck function should define a structure to hold scan data, a lock + to coordinate access to the scan data, and a ``struct xfs_hook`` object. + The scanner function and the regular filesystem code must acquire resources + in the same order; see the next section for details. + +- The online fsck code must contain a C function to catch the hook action code + and data structure. + If the object being updated has already been visited by the scan, then the + hook information must be applied to the scan data. + +- Prior to unlocking inodes to start the scan, online fsck must call + ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and + ``xfs_hooks_add`` to enable the hook. + +- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is + complete. + +The number of hooks should be kept to a minimum to reduce complexity. +Static keys are used to reduce the overhead of filesystem hooks to nearly +zero when online fsck is not running. + +.. _liveupdate: + +Live Updates During a Scan +`````````````````````````` + +The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>` +filesystem code look like this:: + + other program + ↓ + inode lock ←────────────────────┐ + ↓ │ + AG header lock │ + ↓ │ + filesystem function │ + ↓ │ + notifier call chain │ same + ↓ ├─── inode + scrub hook function │ lock + ↓ │ + scan data mutex ←──┐ same │ + ↓ ├─── scan │ + update scan data │ lock │ + ↑ │ │ + scan data mutex ←──┘ │ + ↑ │ + inode lock ←────────────────────┘ + ↑ + scrub function + ↑ + inode scanner + ↑ + xfs_scrub + +These rules must be followed to ensure correct interactions between the +checking code and the code making an update to the filesystem: + +- Prior to invoking the notifier call chain, the filesystem function being + hooked must acquire the same lock that the scrub scanning function acquires + to scan the inode. + +- The scanning function and the scrub hook function must coordinate access to + the scan data by acquiring a lock on the scan data. + +- Scrub hook function must not add the live update information to the scan + observations unless the inode being updated has already been scanned. + The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) + for this. + +- Scrub hook functions must not change the caller's state, including the + transaction that it is running. + They must not acquire any resources that might conflict with the filesystem + function being hooked. + +- The hook function can abort the inode scan to avoid breaking the other rules. + +The inode scan APIs are pretty simple: + +- ``xchk_iscan_start`` starts a scan + +- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or + returns zero if there is nothing left to scan + +- ``xchk_iscan_want_live_update`` to decide if an inode has already been + visited in the scan. + This is critical for hook functions to decide if they need to update the + in-memory scan information. + +- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the + scan + +- ``xchk_iscan_teardown`` to finish the scan + +This functionality is also a part of the +`inode scanner +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ +series. + +.. _quotacheck: + +Case Study: Quota Counter Checking +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is useful to compare the mount time quotacheck code to the online repair +quotacheck code. +Mount time quotacheck does not have to contend with concurrent operations, so +it does the following: + +1. Make sure the ondisk dquots are in good enough shape that all the incore + dquots will actually load, and zero the resource usage counters in the + ondisk buffer. + +2. Walk every inode in the filesystem. + Add each file's resource usage to the incore dquot. + +3. Walk each incore dquot. + If the incore dquot is not being flushed, add the ondisk buffer backing the + incore dquot to a delayed write (delwri) list. + +4. Write the buffer list to disk. + +Like most online fsck functions, online quotacheck can't write to regular +filesystem objects until the newly collected metadata reflect all filesystem +state. +Therefore, online quotacheck records file resource usage to a shadow dquot +index implemented with a sparse ``xfarray``, and only writes to the real dquots +once the scan is complete. +Handling transactional updates is tricky because quota resource usage updates +are handled in phases to minimize contention on dquots: + +1. The inodes involved are joined and locked to a transaction. + +2. For each dquot attached to the file: + + a. The dquot is locked. + + b. A quota reservation is added to the dquot's resource usage. + The reservation is recorded in the transaction. + + c. The dquot is unlocked. + +3. Changes in actual quota usage are tracked in the transaction. + +4. At transaction commit time, each dquot is examined again: + + a. The dquot is locked again. + + b. Quota usage changes are logged and unused reservation is given back to + the dquot. + + c. The dquot is unlocked. + +For online quotacheck, hooks are placed in steps 2 and 4. +The step 2 hook creates a shadow version of the transaction dquot context +(``dqtrx``) that operates in a similar manner to the regular code. +The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. +Notice that both hooks are called with the inode locked, which is how the +live update coordinates with the inode scanner. + +The quotacheck scan looks like this: + +1. Set up a coordinated inode scan. + +2. For each inode returned by the inode scan iterator: + + a. Grab and lock the inode. + + b. Determine that inode's resource usage (data blocks, inode counts, + realtime blocks) and add that to the shadow dquots for the user, group, + and project ids associated with the inode. + + c. Unlock and release the inode. + +3. For each dquot in the system: + + a. Grab and lock the dquot. + + b. Check the dquot against the shadow dquots created by the scan and updated + by the live hooks. + +Live updates are key to being able to walk every quota record without +needing to hold any locks for a long duration. +If repairs are desired, the real and shadow dquots are locked and their +resource counts are set to the values in the shadow dquot. + +The proposed patchset is the +`online quotacheck +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ +series. + +.. _nlinks: + +Case Study: File Link Count Checking +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +File link count checking also uses live update hooks. +The coordinated inode scanner is used to visit all directories on the +filesystem, and per-file link count records are stored in a sparse ``xfarray`` +indexed by inumber. +During the scanning phase, each entry in a directory generates observation +data as follows: + +1. If the entry is a dotdot (``'..'``) entry of the root directory, the + directory's parent link count is bumped because the root directory's dotdot + entry is self referential. + +2. If the entry is a dotdot entry of a subdirectory, the parent's backref + count is bumped. + +3. If the entry is neither a dot nor a dotdot entry, the target file's parent + count is bumped. + +4. If the target is a subdirectory, the parent's child link count is bumped. + +A crucial point to understand about how the link count inode scanner interacts +with the live update hooks is that the scan cursor tracks which *parent* +directories have been scanned. +In other words, the live updates ignore any update about ``A → B`` when A has +not been scanned, even if B has been scanned. +Furthermore, a subdirectory A with a dotdot entry pointing back to B is +accounted as a backref counter in the shadow data for A, since child dotdot +entries affect the parent's link count. +Live update hooks are carefully placed in all parts of the filesystem that +create, change, or remove directory entries, since those operations involve +bumplink and droplink. + +For any file, the correct link count is the number of parents plus the number +of child subdirectories. +Non-directories never have children of any kind. +The backref information is used to detect inconsistencies in the number of +links pointing to child subdirectories and the number of dotdot entries +pointing back. + +After the scan completes, the link count of each file can be checked by locking +both the inode and the shadow data, and comparing the link counts. +A second coordinated inode scan cursor is used for comparisons. +Live updates are key to being able to walk every inode without needing to hold +any locks between inodes. +If repairs are desired, the inode's link count is set to the value in the +shadow information. +If no parents are found, the file must be :ref:`reparented <orphanage>` to the +orphanage to prevent the file from being lost forever. + +The proposed patchset is the +`file link count repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_ +series. + +.. _rmap_repair: + +Case Study: Rebuilding Reverse Mapping Records +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Most repair functions follow the same pattern: lock filesystem resources, +walk the surviving ondisk metadata looking for replacement metadata records, +and use an :ref:`in-memory array <xfarray>` to store the gathered observations. +The primary advantage of this approach is the simplicity and modularity of the +repair code -- code and data are entirely contained within the scrub module, +do not require hooks in the main filesystem, and are usually the most efficient +in memory use. +A secondary advantage of this repair approach is atomicity -- once the kernel +decides a structure is corrupt, no other threads can access the metadata until +the kernel finishes repairing and revalidating the metadata. + +For repairs going on within a shard of the filesystem, these advantages +outweigh the delays inherent in locking the shard while repairing parts of the +shard. +Unfortunately, repairs to the reverse mapping btree cannot use the "standard" +btree repair strategy because it must scan every space mapping of every fork of +every file in the filesystem, and the filesystem cannot stop. +Therefore, rmap repair foregoes atomicity between scrub and repair. +It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks +<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the +scan for reverse mapping records. + +1. Set up an xfbtree to stage rmap records. + +2. While holding the locks on the AGI and AGF buffers acquired during the + scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW + staging extents, and the internal log. + +3. Set up an inode scanner. + +4. Hook into rmap updates for the AG being repaired so that the live scan data + can receive updates to the rmap btree from the rest of the filesystem during + the file scan. + +5. For each space mapping found in either fork of each file scanned, + decide if the mapping matches the AG of interest. + If so: + + a. Create a btree cursor for the in-memory btree. + + b. Use the rmap code to add the record to the in-memory btree. + + c. Use the :ref:`special commit function <xfbtree_commit>` to write the + xfbtree changes to the xfile. + +6. For each live update received via the hook, decide if the owner has already + been scanned. + If so, apply the live update into the scan data: + + a. Create a btree cursor for the in-memory btree. + + b. Replay the operation into the in-memory btree. + + c. Use the :ref:`special commit function <xfbtree_commit>` to write the + xfbtree changes to the xfile. + This is performed with an empty transaction to avoid changing the + caller's state. + +7. When the inode scan finishes, create a new scrub transaction and relock the + two AG headers. + +8. Compute the new btree geometry using the number of rmap records in the + shadow btree, like all other btree rebuilding functions. + +9. Allocate the number of blocks computed in the previous step. + +10. Perform the usual btree bulk loading and commit to install the new rmap + btree. + +11. Reap the old rmap btree blocks as discussed in the case study about how + to :ref:`reap after rmap btree repair <rmap_reap>`. + +12. Free the xfbtree now that it not needed. + +The proposed patchset is the +`rmap repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_ +series. + +Staging Repairs with Temporary Files on Disk +-------------------------------------------- + +XFS stores a substantial amount of metadata in file forks: directories, +extended attributes, symbolic link targets, free space bitmaps and summary +information for the realtime volume, and quota records. +File forks map 64-bit logical file fork space extents to physical storage space +extents, similar to how a memory management unit maps 64-bit virtual addresses +to physical memory addresses. +Therefore, file-based tree structures (such as directories and extended +attributes) use blocks mapped in the file fork offset address space that point +to other blocks mapped within that same address space, and file-based linear +structures (such as bitmaps and quota records) compute array element offsets in +the file fork offset address space. + +Because file forks can consume as much space as the entire filesystem, repairs +cannot be staged in memory, even when a paging scheme is available. +Therefore, online repair of file-based metadata createas a temporary file in +the XFS filesystem, writes a new structure at the correct offsets into the +temporary file, and atomically swaps the fork mappings (and hence the fork +contents) to commit the repair. +Once the repair is complete, the old fork can be reaped as necessary; if the +system goes down during the reap, the iunlink code will delete the blocks +during log recovery. + +**Note**: All space usage and inode indices in the filesystem *must* be +consistent to use a temporary file safely! +This dependency is the reason why online repair can only use pageable kernel +memory to stage ondisk space usage information. + +Swapping metadata extents with a temporary file requires the owner field of the +block headers to match the file being repaired and not the temporary file. The +directory, extended attribute, and symbolic link functions were all modified to +allow callers to specify owner numbers explicitly. + +There is a downside to the reaping process -- if the system crashes during the +reap phase and the fork extents are crosslinked, the iunlink processing will +fail because freeing space will find the extra reverse mappings and abort. + +Temporary files created for repair are similar to ``O_TMPFILE`` files created +by userspace. +They are not linked into a directory and the entire file will be reaped when +the last reference to the file is lost. +The key differences are that these files must have no access permission outside +the kernel at all, they must be specially marked to prevent them from being +opened by handle, and they must never be linked into the directory tree. + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| In the initial iteration of file metadata repair, the damaged metadata | +| blocks would be scanned for salvageable data; the extents in the file | +| fork would be reaped; and then a new structure would be built in its | +| place. | +| This strategy did not survive the introduction of the atomic repair | +| requirement expressed earlier in this document. | +| | +| The second iteration explored building a second structure at a high | +| offset in the fork from the salvage data, reaping the old extents, and | +| using a ``COLLAPSE_RANGE`` operation to slide the new extents into | +| place. | +| | +| This had many drawbacks: | +| | +| - Array structures are linearly addressed, and the regular filesystem | +| codebase does not have the concept of a linear offset that could be | +| applied to the record offset computation to build an alternate copy. | +| | +| - Extended attributes are allowed to use the entire attr fork offset | +| address space. | +| | +| - Even if repair could build an alternate copy of a data structure in a | +| different part of the fork address space, the atomic repair commit | +| requirement means that online repair would have to be able to perform | +| a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old | +| structure was completely replaced. | +| | +| - A crash after construction of the secondary tree but before the range | +| collapse would leave unreachable blocks in the file fork. | +| This would likely confuse things further. | +| | +| - Reaping blocks after a repair is not a simple operation, and | +| initiating a reap operation from a restarted range collapse operation | +| during log recovery is daunting. | +| | +| - Directory entry blocks and quota records record the file fork offset | +| in the header area of each block. | +| An atomic range collapse operation would have to rewrite this part of | +| each block header. | +| Rewriting a single field in block headers is not a huge problem, but | +| it's something to be aware of. | +| | +| - Each block in a directory or extended attributes btree index contains | +| sibling and child block pointers. | +| Were the atomic commit to use a range collapse operation, each block | +| would have to be rewritten very carefully to preserve the graph | +| structure. | +| Doing this as part of a range collapse means rewriting a large number | +| of blocks repeatedly, which is not conducive to quick repairs. | +| | +| This lead to the introduction of temporary file staging. | ++--------------------------------------------------------------------------+ + +Using a Temporary File +`````````````````````` + +Online repair code should use the ``xrep_tempfile_create`` function to create a +temporary file inside the filesystem. +This allocates an inode, marks the in-core inode private, and attaches it to +the scrub context. +These files are hidden from userspace, may not be added to the directory tree, +and must be kept private. + +Temporary files only use two inode locks: the IOLOCK and the ILOCK. +The MMAPLOCK is not needed here, because there must not be page faults from +userspace for data fork blocks. +The usage patterns of these two locks are the same as for any other XFS file -- +access to file data are controlled via the IOLOCK, and access to file metadata +are controlled via the ILOCK. +Locking helpers are provided so that the temporary file and its lock state can +be cleaned up by the scrub context. +To comply with the nested locking strategy laid out in the :ref:`inode +locking<ilocking>` section, it is recommended that scrub functions use the +xrep_tempfile_ilock*_nowait lock helpers. + +Data can be written to a temporary file by two means: + +1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular + temporary file from an xfile. + +2. The regular directory, symbolic link, and extended attribute functions can + be used to write to the temporary file. + +Once a good copy of a data file has been constructed in a temporary file, it +must be conveyed to the file being repaired, which is the topic of the next +section. + +The proposed patches are in the +`repair temporary files +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_ +series. + +Atomic Extent Swapping +---------------------- + +Once repair builds a temporary file with a new data structure written into +it, it must commit the new changes into the existing file. +It is not possible to swap the inumbers of two files, so instead the new +metadata must replace the old. +This suggests the need for the ability to swap extents, but the existing extent +swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient +for online repair because: + +a. When the reverse-mapping btree is enabled, the swap code must keep the + reverse mapping information up to date with every exchange of mappings. + Therefore, it can only exchange one mapping per transaction, and each + transaction is independent. + +b. Reverse-mapping is critical for the operation of online fsck, so the old + defragmentation code (which swapped entire extent forks in a single + operation) is not useful here. + +c. Defragmentation is assumed to occur between two files with identical + contents. + For this use case, an incomplete exchange will not result in a user-visible + change in file contents, even if the operation is interrupted. + +d. Online repair needs to swap the contents of two files that are by definition + *not* identical. + For directory and xattr repairs, the user-visible contents might be the + same, but the contents of individual blocks may be very different. + +e. Old blocks in the file may be cross-linked with another structure and must + not reappear if the system goes down mid-repair. + +These problems are overcome by creating a new deferred operation and a new type +of log intent item to track the progress of an operation to exchange two file +ranges. +The new deferred operation type chains together the same transactions used by +the reverse-mapping extent swap code. +The new log item records the progress of the exchange to ensure that once an +exchange begins, it will always run to completion, even there are +interruptions. +The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag +in the superblock protects these new log item records from being replayed on +old kernels. + +The proposed patchset is the +`atomic extent swap +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_ +series. + ++--------------------------------------------------------------------------+ +| **Sidebar: Using Log-Incompatible Feature Flags** | ++--------------------------------------------------------------------------+ +| Starting with XFS v5, the superblock contains a | +| ``sb_features_log_incompat`` field to indicate that the log contains | +| records that might not readable by all kernels that could mount this | +| filesystem. | +| In short, log incompat features protect the log contents against kernels | +| that will not understand the contents. | +| Unlike the other superblock feature bits, log incompat bits are | +| ephemeral because an empty (clean) log does not need protection. | +| The log cleans itself after its contents have been committed into the | +| filesystem, either as part of an unmount or because the system is | +| otherwise idle. | +| Because upper level code can be working on a transaction at the same | +| time that the log cleans itself, it is necessary for upper level code to | +| communicate to the log when it is going to use a log incompatible | +| feature. | +| | +| The log coordinates access to incompatible features through the use of | +| one ``struct rw_semaphore`` for each feature. | +| The log cleaning code tries to take this rwsem in exclusive mode to | +| clear the bit; if the lock attempt fails, the feature bit remains set. | +| Filesystem code signals its intention to use a log incompat feature in a | +| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem | +| in shared mode. | +| The code supporting a log incompat feature should create wrapper | +| functions to obtain the log feature and call | +| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary | +| superblock. | +| The superblock update is performed transactionally, so the wrapper to | +| obtain log assistance must be called just prior to the creation of the | +| transaction that uses the functionality. | +| For a file operation, this step must happen after taking the IOLOCK | +| and the MMAPLOCK, but before allocating the transaction. | +| When the transaction is complete, the ``xlog_drop_incompat_feat`` | +| function is called to release the feature. | +| The feature bit will not be cleared from the superblock until the log | +| becomes clean. | +| | +| Log-assisted extended attribute updates and atomic extent swaps both use | +| log incompat features and provide convenience wrappers around the | +| functionality. | ++--------------------------------------------------------------------------+ + +Mechanics of an Atomic Extent Swap +`````````````````````````````````` + +Swapping entire file forks is a complex task. +The goal is to exchange all file fork mappings between two file fork offset +ranges. +There are likely to be many extent mappings in each fork, and the edges of +the mappings aren't necessarily aligned. +Furthermore, there may be other updates that need to happen after the swap, +such as exchanging file sizes, inode flags, or conversion of fork data to local +format. +This is roughly the format of the new deferred extent swap work item: + +.. code-block:: c + + struct xfs_swapext_intent { + /* Inodes participating in the operation. */ + struct xfs_inode *sxi_ip1; + struct xfs_inode *sxi_ip2; + + /* File offset range information. */ + xfs_fileoff_t sxi_startoff1; + xfs_fileoff_t sxi_startoff2; + xfs_filblks_t sxi_blockcount; + + /* Set these file sizes after the operation, unless negative. */ + xfs_fsize_t sxi_isize1; + xfs_fsize_t sxi_isize2; + + /* XFS_SWAP_EXT_* log operation flags */ + uint64_t sxi_flags; + }; + +The new log intent item contains enough information to track two logical fork +offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, +blockcount)``. +Each step of a swap operation exchanges the largest file range mapping possible +from one file to the other. +After each step in the swap operation, the two startoff fields are incremented +and the blockcount field is decremented to reflect the progress made. +The flags field captures behavioral parameters such as swapping the attr fork +instead of the data fork and other work to be done after the extent swap. +The two isize fields are used to swap the file size at the end of the operation +if the file data fork is the target of the swap operation. + +When the extent swap is initiated, the sequence of operations is as follows: + +1. Create a deferred work item for the extent swap. + At the start, it should contain the entirety of the file ranges to be + swapped. + +2. Call ``xfs_defer_finish`` to process the exchange. + This is encapsulated in ``xrep_tempswap_contents`` for scrub operations. + This will log an extent swap intent item to the transaction for the deferred + extent swap work item. + +3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero, + + a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and + ``sxi_startoff2``, respectively, and compute the longest extent that can + be swapped in a single step. + This is the minimum of the two ``br_blockcount`` s in the mappings. + Keep advancing through the file forks until at least one of the mappings + contains written blocks. + Mutual holes, unwritten extents, and extent mappings to the same physical + space are not exchanged. + + For the next few steps, this document will refer to the mapping that came + from file 1 as "map1", and the mapping that came from file 2 as "map2". + + b. Create a deferred block mapping update to unmap map1 from file 1. + + c. Create a deferred block mapping update to unmap map2 from file 2. + + d. Create a deferred block mapping update to map map1 into file 2. + + e. Create a deferred block mapping update to map map2 into file 1. + + f. Log the block, quota, and extent count updates for both files. + + g. Extend the ondisk size of either file if necessary. + + h. Log an extent swap done log item for the extent swap intent log item + that was read at the start of step 3. + + i. Compute the amount of file range that has just been covered. + This quantity is ``(map1.br_startoff + map1.br_blockcount - + sxi_startoff1)``, because step 3a could have skipped holes. + + j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2`` + by the number of blocks computed in the previous step, and decrease + ``sxi_blockcount`` by the same quantity. + This advances the cursor. + + k. Log a new extent swap intent log item reflecting the advanced state of + the work item. + + l. Return the proper error code (EAGAIN) to the deferred operation manager + to inform it that there is more work to be done. + The operation manager completes the deferred work in steps 3b-3e before + moving back to the start of step 3. + +4. Perform any post-processing. + This will be discussed in more detail in subsequent sections. + +If the filesystem goes down in the middle of an operation, log recovery will +find the most recent unfinished extent swap log intent item and restart from +there. +This is how extent swapping guarantees that an outside observer will either see +the old broken structure or the new one, and never a mismash of both. + +Preparation for Extent Swapping +``````````````````````````````` + +There are a few things that need to be taken care of before initiating an +atomic extent swap operation. +First, regular files require the page cache to be flushed to disk before the +operation begins, and directio writes to be quiesced. +Like any filesystem operation, extent swapping must determine the maximum +amount of disk space and quota that can be consumed on behalf of both files in +the operation, and reserve that quantity of resources to avoid an unrecoverable +out of space failure once it starts dirtying metadata. +The preparation step scans the ranges of both files to estimate: + +- Data device blocks needed to handle the repeated updates to the fork + mappings. +- Change in data and realtime block counts for both files. +- Increase in quota usage for both files, if the two files do not share the + same set of quota ids. +- The number of extent mappings that will be added to each file. +- Whether or not there are partially written realtime extents. + User programs must never be able to access a realtime file extent that maps + to different extents on the realtime volume, which could happen if the + operation fails to run to completion. + +The need for precise estimation increases the run time of the swap operation, +but it is very important to maintain correct accounting. +The filesystem must not run completely out of free space, nor can the extent +swap ever add more extent mappings to a fork than it can support. +Regular users are required to abide the quota limits, though metadata repairs +may exceed quota to resolve inconsistent metadata elsewhere. + +Special Features for Swapping Metadata File Extents +``````````````````````````````````````````````````` + +Extended attributes, symbolic links, and directories can set the fork format to +"local" and treat the fork as a literal area for data storage. +Metadata repairs must take extra steps to support these cases: + +- If both forks are in local format and the fork areas are large enough, the + swap is performed by copying the incore fork contents, logging both forks, + and committing. + The atomic extent swap mechanism is not necessary, since this can be done + with a single transaction. + +- If both forks map blocks, then the regular atomic extent swap is used. + +- Otherwise, only one fork is in local format. + The contents of the local format fork are converted to a block to perform the + swap. + The conversion to block format must be done in the same transaction that + logs the initial extent swap intent log item. + The regular atomic extent swap is used to exchange the mappings. + Special flags are set on the swap operation so that the transaction can be + rolled one more time to convert the second file's fork back to local format + so that the second file will be ready to go as soon as the ILOCK is dropped. + +Extended attributes and directories stamp the owning inode into every block, +but the buffer verifiers do not actually check the inode number! +Although there is no verification, it is still important to maintain +referential integrity, so prior to performing the extent swap, online repair +builds every block in the new data structure with the owner field of the file +being repaired. + +After a successful swap operation, the repair operation must reap the old fork +blocks by processing each fork mapping through the standard :ref:`file extent +reaping <reaping>` mechanism that is done post-repair. +If the filesystem should go down during the reap part of the repair, the +iunlink processing at the end of recovery will free both the temporary file and +whatever blocks were not reaped. +However, this iunlink processing omits the cross-link detection of online +repair, and is not completely foolproof. + +Swapping Temporary File Extents +``````````````````````````````` + +To repair a metadata file, online repair proceeds as follows: + +1. Create a temporary repair file. + +2. Use the staging data to write out new contents into the temporary repair + file. + The same fork must be written to as is being repaired. + +3. Commit the scrub transaction, since the swap estimation step must be + completed before transaction reservations are made. + +4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with + the appropriate resource reservations, locks, and fill out a ``struct + xfs_swapext_req`` with the details of the swap operation. + +5. Call ``xrep_tempswap_contents`` to swap the contents. + +6. Commit the transaction to complete the repair. + +.. _rtsummary: + +Case Study: Repairing the Realtime Summary File +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In the "realtime" section of an XFS filesystem, free space is tracked via a +bitmap, similar to Unix FFS. +Each bit in the bitmap represents one realtime extent, which is a multiple of +the filesystem block size between 4KiB and 1GiB in size. +The realtime summary file indexes the number of free extents of a given size to +the offset of the block within the realtime free space bitmap where those free +extents begin. +In other words, the summary file helps the allocator find free extents by +length, similar to what the free space by count (cntbt) btree does for the data +section. + +The summary file itself is a flat file (with no block headers or checksums!) +partitioned into ``log2(total rt extents)`` sections containing enough 32-bit +counters to match the number of blocks in the rt bitmap. +Each counter records the number of free extents that start in that bitmap block +and can satisfy a power-of-two allocation request. + +To check the summary file against the bitmap: + +1. Take the ILOCK of both the realtime bitmap and summary files. + +2. For each free space extent recorded in the bitmap: + + a. Compute the position in the summary file that contains a counter that + represents this free extent. + + b. Read the counter from the xfile. + + c. Increment it, and write it back to the xfile. + +3. Compare the contents of the xfile against the ondisk file. + +To repair the summary file, write the xfile contents into the temporary file +and use atomic extent swap to commit the new contents. +The temporary file is then reaped. + +The proposed patchset is the +`realtime summary repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_ +series. + +Case Study: Salvaging Extended Attributes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In XFS, extended attributes are implemented as a namespaced name-value store. +Values are limited in size to 64KiB, but there is no limit in the number of +names. +The attribute fork is unpartitioned, which means that the root of the attribute +structure is always in logical block zero, but attribute leaf blocks, dabtree +index blocks, and remote value blocks are intermixed. +Attribute leaf blocks contain variable-sized records that associate +user-provided names with the user-provided values. +Values larger than a block are allocated separate extents and written there. +If the leaf information expands beyond a single block, a directory/attribute +btree (``dabtree``) is created to map hashes of attribute names to entries +for fast lookup. + +Salvaging extended attributes is done as follows: + +1. Walk the attr fork mappings of the file being repaired to find the attribute + leaf blocks. + When one is found, + + a. Walk the attr leaf block to find candidate keys. + When one is found, + + 1. Check the name for problems, and ignore the name if there are. + + 2. Retrieve the value. + If that succeeds, add the name and value to the staging xfarray and + xfblob. + +2. If the memory usage of the xfarray and xfblob exceed a certain amount of + memory or there are no more attr fork blocks to examine, unlock the file and + add the staged extended attributes to the temporary file. + +3. Use atomic extent swapping to exchange the new and old extended attribute + structures. + The old attribute blocks are now attached to the temporary file. + +4. Reap the temporary file. + +The proposed patchset is the +`extended attribute repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ +series. + +Fixing Directories +------------------ + +Fixing directories is difficult with currently available filesystem features, +since directory entries are not redundant. +The offline repair tool scans all inodes to find files with nonzero link count, +and then it scans all directories to establish parentage of those linked files. +Damaged files and directories are zapped, and files with no parent are +moved to the ``/lost+found`` directory. +It does not try to salvage anything. + +The best that online repair can do at this time is to read directory data +blocks and salvage any dirents that look plausible, correct link counts, and +move orphans back into the directory tree. +The salvage process is discussed in the case study at the end of this section. +The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts +and moving orphans to the ``/lost+found`` directory. + +Case Study: Salvaging Directories +````````````````````````````````` + +Unlike extended attributes, directory blocks are all the same size, so +salvaging directories is straightforward: + +1. Find the parent of the directory. + If the dotdot entry is not unreadable, try to confirm that the alleged + parent has a child entry pointing back to the directory being repaired. + Otherwise, walk the filesystem to find it. + +2. Walk the first partition of data fork of the directory to find the directory + entry data blocks. + When one is found, + + a. Walk the directory data block to find candidate entries. + When an entry is found: + + i. Check the name for problems, and ignore the name if there are. + + ii. Retrieve the inumber and grab the inode. + If that succeeds, add the name, inode number, and file type to the + staging xfarray and xblob. + +3. If the memory usage of the xfarray and xfblob exceed a certain amount of + memory or there are no more directory data blocks to examine, unlock the + directory and add the staged dirents into the temporary directory. + Truncate the staging files. + +4. Use atomic extent swapping to exchange the new and old directory structures. + The old directory blocks are now attached to the temporary file. + +5. Reap the temporary file. + +**Future Work Question**: Should repair revalidate the dentry cache when +rebuilding a directory? + +*Answer*: Yes, it should. + +In theory it is necessary to scan all dentry cache entries for a directory to +ensure that one of the following apply: + +1. The cached dentry reflects an ondisk dirent in the new directory. + +2. The cached dentry no longer has a corresponding ondisk dirent in the new + directory and the dentry can be purged from the cache. + +3. The cached dentry no longer has an ondisk dirent but the dentry cannot be + purged. + This is the problem case. + +Unfortunately, the current dentry cache design doesn't provide a means to walk +every child dentry of a specific directory, which makes this a hard problem. +There is no known solution. + +The proposed patchset is the +`directory repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ +series. + +Parent Pointers +``````````````` + +A parent pointer is a piece of file metadata that enables a user to locate the +file's parent directory without having to traverse the directory tree from the +root. +Without them, reconstruction of directory trees is hindered in much the same +way that the historic lack of reverse space mapping information once hindered +reconstruction of filesystem space metadata. +The parent pointer feature, however, makes total directory reconstruction +possible. + +XFS parent pointers include the dirent name and location of the entry within +the parent directory. +In other words, child files use extended attributes to store pointers to +parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. +The directory checking process can be strengthened to ensure that the target of +each dirent also contains a parent pointer pointing back to the dirent. +Likewise, each parent pointer can be checked by ensuring that the target of +each parent pointer is a directory and that it contains a dirent matching +the parent pointer. +Both online and offline repair can use this strategy. + +**Note**: The ondisk format of parent pointers is not yet finalized. + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| Directory parent pointers were first proposed as an XFS feature more | +| than a decade ago by SGI. | +| Each link from a parent directory to a child file is mirrored with an | +| extended attribute in the child that could be used to identify the | +| parent directory. | +| Unfortunately, this early implementation had major shortcomings and was | +| never merged into Linux XFS: | +| | +| 1. The XFS codebase of the late 2000s did not have the infrastructure to | +| enforce strong referential integrity in the directory tree. | +| It did not guarantee that a change in a forward link would always be | +| followed up with the corresponding change to the reverse links. | +| | +| 2. Referential integrity was not integrated into offline repair. | +| Checking and repairs were performed on mounted filesystems without | +| taking any kernel or inode locks to coordinate access. | +| It is not clear how this actually worked properly. | +| | +| 3. The extended attribute did not record the name of the directory entry | +| in the parent, so the SGI parent pointer implementation cannot be | +| used to reconnect the directory tree. | +| | +| 4. Extended attribute forks only support 65,536 extents, which means | +| that parent pointer attribute creation is likely to fail at some | +| point before the maximum file link count is achieved. | +| | +| The original parent pointer design was too unstable for something like | +| a file system repair to depend on. | +| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a | +| second implementation that solves all shortcomings of the first. | +| During 2022, Allison introduced log intent items to track physical | +| manipulations of the extended attribute structures. | +| This solves the referential integrity problem by making it possible to | +| commit a dirent update and a parent pointer update in the same | +| transaction. | +| Chandan increased the maximum extent counts of both data and attribute | +| forks, thereby ensuring that the extended attribute structure can grow | +| to handle the maximum hardlink count of any file. | ++--------------------------------------------------------------------------+ + +Case Study: Repairing Directories with Parent Pointers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and +a :ref:`directory entry live update hook <liveupdate>` as follows: + +1. Set up a temporary directory for generating the new directory structure, + an xfblob for storing entry names, and an xfarray for stashing directory + updates. + +2. Set up an inode scanner and hook into the directory entry code to receive + updates on directory operations. + +3. For each parent pointer found in each file scanned, decide if the parent + pointer references the directory of interest. + If so: + + a. Stash an addname entry for this dirent in the xfarray for later. + + b. When finished scanning that file, flush the stashed updates to the + temporary directory. + +4. For each live directory update received via the hook, decide if the child + has already been scanned. + If so: + + a. Stash an addname or removename entry for this dirent update in the + xfarray for later. + We cannot write directly to the temporary directory because hook + functions are not allowed to modify filesystem metadata. + Instead, we stash updates in the xfarray and rely on the scanner thread + to apply the stashed updates to the temporary directory. + +5. When the scan is complete, atomically swap the contents of the temporary + directory and the directory being repaired. + The temporary directory now contains the damaged directory structure. + +6. Reap the temporary directory. + +7. Update the dirent position field of parent pointers as necessary. + This may require the queuing of a substantial number of xattr log intent + items. + +The proposed patchset is the +`parent pointers directory repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_ +series. + +**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields +match in the reconstructed directory? + +*Answer*: There are a few ways to solve this problem: + +1. The field could be designated advisory, since the other three values are + sufficient to find the entry in the parent. + However, this makes indexed key lookup impossible while repairs are ongoing. + +2. We could allow creating directory entries at specified offsets, which solves + the referential integrity problem but runs the risk that dirent creation + will fail due to conflicts with the free space in the directory. + + These conflicts could be resolved by appending the directory entry and + amending the xattr code to support updating an xattr key and reindexing the + dabtree, though this would have to be performed with the parent directory + still locked. + +3. Same as above, but remove the old parent pointer entry and add a new one + atomically. + +4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``, + which would provide the attr name uniqueness that we require, without + forcing repair code to update the dirent position. + Unfortunately, this requires changes to the xattr code to support attr + names as long as 263 bytes. + +5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → + (name, parent_gen)``. + If the hash is sufficiently resistant to collisions (e.g. sha256) then + this should provide the attr name uniqueness that we require. + Names shorter than 247 bytes could be stored directly. + +Discussion is ongoing under the `parent pointers patch deluge +<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_. + +Case Study: Repairing Parent Pointers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Online reconstruction of a file's parent pointer information works similarly to +directory reconstruction: + +1. Set up a temporary file for generating a new extended attribute structure, + an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for + stashing parent pointer updates. + +2. Set up an inode scanner and hook into the directory entry code to receive + updates on directory operations. + +3. For each directory entry found in each directory scanned, decide if the + dirent references the file of interest. + If so: + + a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray + for later. + + b. When finished scanning the directory, flush the stashed updates to the + temporary directory. + +4. For each live directory update received via the hook, decide if the parent + has already been scanned. + If so: + + a. Stash an addpptr or removepptr entry for this dirent update in the + xfarray for later. + We cannot write parent pointers directly to the temporary file because + hook functions are not allowed to modify filesystem metadata. + Instead, we stash updates in the xfarray and rely on the scanner thread + to apply the stashed parent pointer updates to the temporary file. + +5. Copy all non-parent pointer extended attributes to the temporary file. + +6. When the scan is complete, atomically swap the attribute fork of the + temporary file and the file being repaired. + The temporary file now contains the damaged extended attribute structure. + +7. Reap the temporary file. + +The proposed patchset is the +`parent pointers repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_ +series. + +Digression: Offline Checking of Parent Pointers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Examining parent pointers in offline repair works differently because corrupt +files are erased long before directory tree connectivity checks are performed. +Parent pointer checks are therefore a second pass to be added to the existing +connectivity checks: + +1. After the set of surviving files has been established (i.e. phase 6), + walk the surviving directories of each AG in the filesystem. + This is already performed as part of the connectivity checks. + +2. For each directory entry found, record the name in an xfblob, and store + ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a + per-AG in-memory slab. + +3. For each AG in the filesystem, + + a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and + dirent_pos. + + b. For each inode in the AG, + + 1. Scan the inode for parent pointers. + Record the names in a per-file xfblob, and store ``(parent_inum, + parent_gen, dirent_pos)`` tuples in a per-file slab. + + 2. Sort the per-file tuples in order of parent_inum, and dirent_pos. + + 3. Position one slab cursor at the start of the inode's records in the + per-AG tuple slab. + This should be trivial since the per-AG tuples are in child inumber + order. + + 4. Position a second slab cursor at the start of the per-file tuple slab. + + 5. Iterate the two cursors in lockstep, comparing the parent_ino and + dirent_pos fields of the records under each cursor. + + a. Tuples in the per-AG list but not the per-file list are missing and + need to be written to the inode. + + b. Tuples in the per-file list but not the per-AG list are dangling + and need to be removed from the inode. + + c. For tuples in both lists, update the parent_gen and name components + of the parent pointer if necessary. + +4. Move on to examining link counts, as we do today. + +The proposed patchset is the +`offline parent pointers repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_ +series. + +Rebuilding directories from parent pointers in offline repair is very +challenging because it currently uses a single-pass scan of the filesystem +during phase 3 to decide which files are corrupt enough to be zapped. +This scan would have to be converted into a multi-pass scan: + +1. The first pass of the scan zaps corrupt inodes, forks, and attributes + much as it does now. + Corrupt directories are noted but not zapped. + +2. The next pass records parent pointers pointing to the directories noted + as being corrupt in the first pass. + This second pass may have to happen after the phase 4 scan for duplicate + blocks, if phase 4 is also capable of zapping directories. + +3. The third pass resets corrupt directories to an empty shortform directory. + Free space metadata has not been ensured yet, so repair cannot yet use the + directory building code in libxfs. + +4. At the start of phase 6, space metadata have been rebuilt. + Use the parent pointer information recorded during step 2 to reconstruct + the dirents and add them to the now-empty directories. + +This code has not yet been constructed. + +.. _orphanage: + +The Orphanage +------------- + +Filesystems present files as a directed, and hopefully acyclic, graph. +In other words, a tree. +The root of the filesystem is a directory, and each entry in a directory points +downwards either to more subdirectories or to non-directory files. +Unfortunately, a disruption in the directory graph pointers result in a +disconnected graph, which makes files impossible to access via regular path +resolution. + +Without parent pointers, the directory parent pointer online scrub code can +detect a dotdot entry pointing to a parent directory that doesn't have a link +back to the child directory and the file link count checker can detect a file +that isn't pointed to by any directory in the filesystem. +If such a file has a positive link count, the file is an orphan. + +With parent pointers, directories can be rebuilt by scanning parent pointers +and parent pointers can be rebuilt by scanning directories. +This should reduce the incidence of files ending up in ``/lost+found``. + +When orphans are found, they should be reconnected to the directory tree. +Offline fsck solves the problem by creating a directory ``/lost+found`` to +serve as an orphanage, and linking orphan files into the orphanage by using the +inumber as the name. +Reparenting a file to the orphanage does not reset any of its permissions or +ACLs. + +This process is more involved in the kernel than it is in userspace. +The directory and file link count repair setup functions must use the regular +VFS mechanisms to create the orphanage directory with all the necessary +security attributes and dentry cache entries, just like a regular directory +tree modification. + +Orphaned files are adopted by the orphanage as follows: + +1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function + to try to ensure that the lost and found directory actually exists. + This also attaches the orphanage directory to the scrub context. + +2. If the decision is made to reconnect a file, take the IOLOCK of both the + orphanage and the file being reattached. + The ``xrep_orphanage_iolock_two`` function follows the inode locking + strategy discussed earlier. + +3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name`` + to compute the new name in the orphanage and the block reservation required. + +4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair + transaction. + +5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost + and found, and update the kernel dentry cache. + +The proposed patches are in the +`orphanage adoption +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_ +series. + +6. Userspace Algorithms and Data Structures +=========================================== + +This section discusses the key algorithms and data structures of the userspace +program, ``xfs_scrub``, that provide the ability to drive metadata checks and +repairs in the kernel, verify file data, and look for other potential problems. + +.. _scrubcheck: + +Checking Metadata +----------------- + +Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier. +That structure follows naturally from the data dependencies designed into the +filesystem from its beginnings in 1993. +In XFS, there are several groups of metadata dependencies: + +a. Filesystem summary counts depend on consistency within the inode indices, + the allocation group space btrees, and the realtime volume space + information. + +b. Quota resource counts depend on consistency within the quota file data + forks, inode indices, inode records, and the forks of every file on the + system. + +c. The naming hierarchy depends on consistency within the directory and + extended attribute structures. + This includes file link counts. + +d. Directories, extended attributes, and file data depend on consistency within + the file forks that map directory and extended attribute data to physical + storage media. + +e. The file forks depends on consistency within inode records and the space + metadata indices of the allocation groups and the realtime volume. + This includes quota and realtime metadata files. + +f. Inode records depends on consistency within the inode metadata indices. + +g. Realtime space metadata depend on the inode records and data forks of the + realtime metadata inodes. + +h. The allocation group metadata indices (free space, inodes, reference count, + and reverse mapping btrees) depend on consistency within the AG headers and + between all the AG metadata btrees. + +i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support + for online fsck functionality. + +Therefore, a metadata dependency graph is a convenient way to schedule checking +operations in the ``xfs_scrub`` program: + +- Phase 1 checks that the provided path maps to an XFS filesystem and detect + the kernel's scrubbing abilities, which validates group (i). + +- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue. + +- Phase 3 scans inodes in parallel. + For each inode, groups (f), (e), and (d) are checked, in that order. + +- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 + may run reliably. + +- Phase 5 starts by checking groups (b) and (c) in parallel before moving on + to checking names. + +- Phase 6 depends on groups (i) through (b) to find file data blocks to verify, + to read them, and to report which blocks of which files are affected. + +- Phase 7 checks group (a), having validated everything else. + +Notice that the data dependencies between groups are enforced by the structure +of the program flow. + +Parallel Inode Scans +-------------------- + +An XFS filesystem can easily contain hundreds of millions of inodes. +Given that XFS targets installations with large high-performance storage, +it is desirable to scrub inodes in parallel to minimize runtime, particularly +if the program has been invoked manually from a command line. +This requires careful scheduling to keep the threads as evenly loaded as +possible. + +Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single +workqueue and scheduled a single workqueue item per AG. +Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find +inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough +information to construct file handles. +The file handle was then passed to a function to generate scrub items for each +metadata object of each inode. +This simple algorithm leads to thread balancing problems in phase 3 if the +filesystem contains one AG with a few large sparse files and the rest of the +AGs contain many smaller files. +The inode scan dispatch function was not sufficiently granular; it should have +been dispatching at the level of individual inodes, or, to constrain memory +consumption, inode btree records. + +Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to +avoid this problem with ease by adding a second workqueue. +Just like before, the first workqueue is seeded with one workqueue item per AG, +and it uses INUMBERS to find inode btree chunks. +The second workqueue, however, is configured with an upper bound on the number +of items that can be waiting to be run. +Each inode btree chunk found by the first workqueue's workers are queued to the +second workqueue, and it is this second workqueue that queries BULKSTAT, +creates a file handle, and passes it to a function to generate scrub items for +each metadata object of each inode. +If the second workqueue is too full, the workqueue add function blocks the +first workqueue's workers until the backlog eases. +This doesn't completely solve the balancing problem, but reduces it enough to +move on to more pressing issues. + +The proposed patchsets are the scrub +`performance tweaks +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_ +and the +`inode scan rebalance +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_ +series. + +.. _scrubrepair: + +Scheduling Repairs +------------------ + +During phase 2, corruptions and inconsistencies reported in any AGI header or +inode btree are repaired immediately, because phase 3 relies on proper +functioning of the inode indices to find inodes to scan. +Failed repairs are rescheduled to phase 4. +Problems reported in any other space metadata are deferred to phase 4. +Optimization opportunities are always deferred to phase 4, no matter their +origin. + +During phase 3, corruptions and inconsistencies reported in any part of a +file's metadata are repaired immediately if all space metadata were validated +during phase 2. +Repairs that fail or cannot be repaired immediately are scheduled for phase 4. + +In the original design of ``xfs_scrub``, it was thought that repairs would be +so infrequent that the ``struct xfs_scrub_metadata`` objects used to +communicate with the kernel could also be used as the primary object to +schedule repairs. +With recent increases in the number of optimizations possible for a given +filesystem object, it became much more memory-efficient to track all eligible +repairs for a given filesystem object with a single repair item. +Each repair item represents a single lockable object -- AGs, metadata files, +individual inodes, or a class of summary information. + +Phase 4 is responsible for scheduling a lot of repair work in as quick a +manner as is practical. +The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which +means that ``xfs_scrub`` must try to complete the repair work scheduled by +phase 2 before trying repair work scheduled by phase 3. +The repair process is as follows: + +1. Start a round of repair with a workqueue and enough workers to keep the CPUs + as busy as the user desires. + + a. For each repair item queued by phase 2, + + i. Ask the kernel to repair everything listed in the repair item for a + given filesystem object. + + ii. Make a note if the kernel made any progress in reducing the number + of repairs needed for this object. + + iii. If the object no longer requires repairs, revalidate all metadata + associated with this object. + If the revalidation succeeds, drop the repair item. + If not, requeue the item for more repairs. + + b. If any repairs were made, jump back to 1a to retry all the phase 2 items. + + c. For each repair item queued by phase 3, + + i. Ask the kernel to repair everything listed in the repair item for a + given filesystem object. + + ii. Make a note if the kernel made any progress in reducing the number + of repairs needed for this object. + + iii. If the object no longer requires repairs, revalidate all metadata + associated with this object. + If the revalidation succeeds, drop the repair item. + If not, requeue the item for more repairs. + + d. If any repairs were made, jump back to 1c to retry all the phase 3 items. + +2. If step 1 made any repair progress of any kind, jump back to step 1 to start + another round of repair. + +3. If there are items left to repair, run them all serially one more time. + Complain if the repairs were not successful, since this is the last chance + to repair anything. + +Corruptions and inconsistencies encountered during phases 5 and 7 are repaired +immediately. +Corrupt file data blocks reported by phase 6 cannot be recovered by the +filesystem. + +The proposed patchsets are the +`repair warning improvements +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_, +refactoring of the +`repair data dependency +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_ +and +`object tracking +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_, +and the +`repair scheduling +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_ +improvement series. + +Checking Names for Confusable Unicode Sequences +----------------------------------------------- + +If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of +phase 4, it moves on to phase 5, which checks for suspicious looking names in +the filesystem. +These names consist of the filesystem label, names in directory entries, and +the names of extended attributes. +Like most Unix filesystems, XFS imposes the sparest of constraints on the +contents of a name: + +- Slashes and null bytes are not allowed in directory entries. + +- Null bytes are not allowed in userspace-visible extended attributes. + +- Null bytes are not allowed in the filesystem label. + +Directory entries and attribute keys store the length of the name explicitly +ondisk, which means that nulls are not name terminators. +For this section, the term "naming domain" refers to any place where names are +presented together -- all the names in a directory, or all the attributes of a +file. + +Although the Unix naming constraints are very permissive, the reality of most +modern-day Linux systems is that programs work with Unicode character code +points to support international languages. +These programs typically encode those code points in UTF-8 when interfacing +with the C library because the kernel expects null-terminated names. +In the common case, therefore, names found in an XFS filesystem are actually +UTF-8 encoded Unicode data. + +To maximize its expressiveness, the Unicode standard defines separate control +points for various characters that render similarly or identically in writing +systems around the world. +For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders +identically to "Latin Small Letter A" U+0061 "a". + +The standard also permits characters to be constructed in multiple ways -- +either by using a defined code point, or by combining one code point with +various combining marks. +For example, the character "Angstrom Sign U+212B "Å" can also be expressed +as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above" +U+030A "◌̊". +Both sequences render identically. + +Like the standards that preceded it, Unicode also defines various control +characters to alter the presentation of text. +For example, the character "Right-to-Left Override" U+202E can trick some +programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png". +A second category of rendering problems involves whitespace characters. +If the character "Zero Width Space" U+200B is encountered in a file name, the +name will render identically to a name that does not have the zero width +space. + +If two names within a naming domain have different byte sequences but render +identically, a user may be confused by it. +The kernel, in its indifference to upper level encoding schemes, permits this. +Most filesystem drivers persist the byte sequence names that are given to them +by the VFS. + +Techniques for detecting confusable names are explained in great detail in +sections 4 and 5 of the +`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_ +document. +When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the +Unicode normalization form NFD in conjunction with the confusable name +detection component of +`libicu <https://github.com/unicode-org/icu>`_ +to identify names with a directory or within a file's extended attributes that +could be confused for each other. +Names are also checked for control characters, non-rendering characters, and +mixing of bidirectional characters. +All of these potential issues are reported to the system administrator during +phase 5. + +Media Verification of File Data Extents +--------------------------------------- + +The system administrator can elect to initiate a media scan of all file data +blocks. +This scan after validation of all filesystem metadata (except for the summary +counters) as phase 6. +The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map +to find areas that are allocated to file data fork extents. +Gaps between data fork extents that are smaller than 64k are treated as if +they were data fork extents to reduce the command setup overhead. +When the space map scan accumulates a region larger than 32MB, a media +verification request is sent to the disk as a directio read of the raw block +device. + +If the verification read fails, ``xfs_scrub`` retries with single-block reads +to narrow down the failure to the specific region of the media and recorded. +When it has finished issuing verification requests, it again uses the space +mapping ioctl to map the recorded media errors back to metadata structures +and report what has been lost. +For media errors in blocks owned by files, parent pointers can be used to +construct file paths from inode numbers for user-friendly reporting. + +7. Conclusion and Future Work +============================= + +It is hoped that the reader of this document has followed the designs laid out +in this document and now has some familiarity with how XFS performs online +rebuilding of its metadata indices, and how filesystem users can interact with +that functionality. +Although the scope of this work is daunting, it is hoped that this guide will +make it easier for code readers to understand what has been built, for whom it +has been built, and why. +Please feel free to contact the XFS mailing list with questions. + +FIEXCHANGE_RANGE +---------------- + +As discussed earlier, a second frontend to the atomic extent swap mechanism is +a new ioctl call that userspace programs can use to commit updates to files +atomically. +This frontend has been out for review for several years now, though the +necessary refinements to online repair and lack of customer demand mean that +the proposal has not been pushed very hard. + +Extent Swapping with Regular User Files +``````````````````````````````````````` + +As mentioned earlier, XFS has long had the ability to swap extents between +files, which is used almost exclusively by ``xfs_fsr`` to defragment files. +The earliest form of this was the fork swap mechanism, where the entire +contents of data forks could be exchanged between two files by exchanging the +raw bytes in each inode fork's immediate area. +When XFS v5 came along with self-describing metadata, this old mechanism grew +some log support to continue rewriting the owner fields of BMBT blocks during +log recovery. +When the reverse mapping btree was later added to XFS, the only way to maintain +the consistency of the fork mappings with the reverse mapping index was to +develop an iterative mechanism that used deferred bmap and rmap operations to +swap mappings one at a time. +This mechanism is identical to steps 2-3 from the procedure above except for +the new tracking items, because the atomic extent swap mechanism is an +iteration of an existing mechanism and not something totally novel. +For the narrow case of file defragmentation, the file contents must be +identical, so the recovery guarantees are not much of a gain. + +Atomic extent swapping is much more flexible than the existing swapext +implementations because it can guarantee that the caller never sees a mix of +old and new contents even after a crash, and it can operate on two arbitrary +file fork ranges. +The extra flexibility enables several new use cases: + +- **Atomic commit of file writes**: A userspace process opens a file that it + wants to update. + Next, it opens a temporary file and calls the file clone operation to reflink + the first file's contents into the temporary file. + Writes to the original file should instead be written to the temporary file. + Finally, the process calls the atomic extent swap system call + (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all + of the updates to the original file, or none of them. + +.. _swapext_if_unchanged: + +- **Transactional file updates**: The same mechanism as above, but the caller + only wants the commit to occur if the original file's contents have not + changed. + To make this happen, the calling process snapshots the file modification and + change timestamps of the original file before reflinking its data to the + temporary file. + When the program is ready to commit the changes, it passes the timestamps + into the kernel as arguments to the atomic extent swap system call. + The kernel only commits the changes if the provided timestamps match the + original file. + +- **Emulation of atomic block device writes**: Export a block device with a + logical sector size matching the filesystem block size to force all writes + to be aligned to the filesystem block size. + Stage all writes to a temporary file, and when that is complete, call the + atomic extent swap system call with a flag to indicate that holes in the + temporary file should be ignored. + This emulates an atomic device write in software, and can support arbitrary + scattered writes. + +Vectorized Scrub +---------------- + +As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned +earlier was a catalyst for enabling a vectorized scrub system call. +Since 2018, the cost of making a kernel call has increased considerably on some +systems to mitigate the effects of speculative execution attacks. +This incentivizes program authors to make as few system calls as possible to +reduce the number of times an execution path crosses a security boundary. + +With vectorized scrub, userspace pushes to the kernel the identity of a +filesystem object, a list of scrub types to run against that object, and a +simple representation of the data dependencies between the selected scrub +types. +The kernel executes as much of the caller's plan as it can until it hits a +dependency that cannot be satisfied due to a corruption, and tells userspace +how much was accomplished. +It is hoped that ``io_uring`` will pick up enough of this functionality that +online fsck can use that instead of adding a separate vectored scrub system +call to XFS. + +The relevant patchsets are the +`kernel vectorized scrub +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_ +and +`userspace vectorized scrub +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_ +series. + +Quality of Service Targets for Scrub +------------------------------------ + +One serious shortcoming of the online fsck code is that the amount of time that +it can spend in the kernel holding resource locks is basically unbounded. +Userspace is allowed to send a fatal signal to the process which will cause +``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way +for userspace to provide a time budget to the kernel. +Given that the scrub codebase has helpers to detect fatal signals, it shouldn't +be too much work to allow userspace to specify a timeout for a scrub/repair +operation and abort the operation if it exceeds budget. +However, most repair functions have the property that once they begin to touch +ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS +timeout is no longer useful. + +Defragmenting Free Space +------------------------ + +Over the years, many XFS users have requested the creation of a program to +clear a portion of the physical storage underlying a filesystem so that it +becomes a contiguous chunk of free space. +Call this free space defragmenter ``clearspace`` for short. + +The first piece the ``clearspace`` program needs is the ability to read the +reverse mapping index from userspace. +This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl. +The second piece it needs is a new fallocate mode +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and +maps it to a file. +Call this file the "space collector" file. +The third piece is the ability to force an online repair. + +To clear all the metadata out of a portion of physical storage, clearspace +uses the new fallocate map-freespace call to map any free space in that region +to the space collector file. +Next, clearspace finds all metadata blocks in that region by way of +``GETFSMAP`` and issues forced repair requests on the data structure. +This often results in the metadata being rebuilt somewhere that is not being +cleared. +After each relocation, clearspace calls the "map free space" function again to +collect any newly freed space in the region being cleared. + +To clear all the file data out of a portion of the physical storage, clearspace +uses the FSMAP information to find relevant file data blocks. +Having identified a good target, it uses the ``FICLONERANGE`` call on that part +of the file to try to share the physical space with a dummy file. +Cloning the extent means that the original owners cannot overwrite the +contents; any changes will be written somewhere else via copy-on-write. +Clearspace makes its own copy of the frozen extent in an area that is not being +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap +<swapext_if_unchanged>` feature) to change the target file's data extent +mapping away from the area being cleared. +When all other mappings have been moved, clearspace reflinks the space into the +space collector file so that it becomes unavailable. + +There are further optimizations that could apply to the above algorithm. +To clear a piece of physical storage that has a high sharing factor, it is +strongly desirable to retain this sharing factor. +In fact, these extents should be moved first to maximize sharing factor after +the operation completes. +To make this work smoothly, clearspace needs a new ioctl +(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace. +With the refcount information exposed, clearspace can quickly find the longest, +most shared data extents in the filesystem, and target them first. + +**Future Work Question**: How might the filesystem move inode chunks? + +*Answer*: To move inode chunks, Dave Chinner constructed a prototype program +that creates a new file with the old contents and then locklessly runs around +the filesystem updating directory entries. +The operation cannot complete if the filesystem goes down. +That problem isn't totally insurmountable: create an inode remapping table +hidden behind a jump label, and a log item that tracks the kernel walking the +filesystem to update directory entries. +The trouble is, the kernel can't do anything about open files, since it cannot +revoke them. + +**Future Work Question**: Can static keys be used to minimize the cost of +supporting ``revoke()`` on XFS files? + +*Answer*: Yes. +Until the first revocation, the bailout code need not be in the call path at +all. + +The relevant patchsets are the +`kernel freespace defrag +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_ +and +`userspace freespace defrag +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_ +series. + +Shrinking Filesystems +--------------------- + +Removing the end of the filesystem ought to be a simple matter of evacuating +the data and metadata at the end of the filesystem, and handing the freed space +to the shrink code. +That requires an evacuation of the space at end of the filesystem, which is a +use of free space defragmentation! diff --git a/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst new file mode 100644 index 000000000000..a10c4ae6955e --- /dev/null +++ b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst @@ -0,0 +1,353 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _xfs_self_describing_metadata: + +============================ +XFS Self Describing Metadata +============================ + +Introduction +============ + +The largest scalability problem facing XFS is not one of algorithmic +scalability, but of verification of the filesystem structure. Scalabilty of the +structures and indexes on disk and the algorithms for iterating them are +adequate for supporting PB scale filesystems with billions of inodes, however it +is this very scalability that causes the verification problem. + +Almost all metadata on XFS is dynamically allocated. The only fixed location +metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all +other metadata structures need to be discovered by walking the filesystem +structure in different ways. While this is already done by userspace tools for +validating and repairing the structure, there are limits to what they can +verify, and this in turn limits the supportable size of an XFS filesystem. + +For example, it is entirely possible to manually use xfs_db and a bit of +scripting to analyse the structure of a 100TB filesystem when trying to +determine the root cause of a corruption problem, but it is still mainly a +manual task of verifying that things like single bit errors or misplaced writes +weren't the ultimate cause of a corruption event. It may take a few hours to a +few days to perform such forensic analysis, so for at this scale root cause +analysis is entirely possible. + +However, if we scale the filesystem up to 1PB, we now have 10x as much metadata +to analyse and so that analysis blows out towards weeks/months of forensic work. +Most of the analysis work is slow and tedious, so as the amount of analysis goes +up, the more likely that the cause will be lost in the noise. Hence the primary +concern for supporting PB scale filesystems is minimising the time and effort +required for basic forensic analysis of the filesystem structure. + + +Self Describing Metadata +======================== + +One of the problems with the current metadata format is that apart from the +magic number in the metadata block, we have no other way of identifying what it +is supposed to be. We can't even identify if it is the right place. Put simply, +you can't look at a single metadata block in isolation and say "yes, it is +supposed to be there and the contents are valid". + +Hence most of the time spent on forensic analysis is spent doing basic +verification of metadata values, looking for values that are in range (and hence +not detected by automated verification checks) but are not correct. Finding and +understanding how things like cross linked block lists (e.g. sibling +pointers in a btree end up with loops in them) are the key to understanding what +went wrong, but it is impossible to tell what order the blocks were linked into +each other or written to disk after the fact. + +Hence we need to record more information into the metadata to allow us to +quickly determine if the metadata is intact and can be ignored for the purpose +of analysis. We can't protect against every possible type of error, but we can +ensure that common types of errors are easily detectable. Hence the concept of +self describing metadata. + +The first, fundamental requirement of self describing metadata is that the +metadata object contains some form of unique identifier in a well known +location. This allows us to identify the expected contents of the block and +hence parse and verify the metadata object. IF we can't independently identify +the type of metadata in the object, then the metadata doesn't describe itself +very well at all! + +Luckily, almost all XFS metadata has magic numbers embedded already - only the +AGFL, remote symlinks and remote attribute blocks do not contain identifying +magic numbers. Hence we can change the on-disk format of all these objects to +add more identifying information and detect this simply by changing the magic +numbers in the metadata objects. That is, if it has the current magic number, +the metadata isn't self identifying. If it contains a new magic number, it is +self identifying and we can do much more expansive automated verification of the +metadata object at runtime, during forensic analysis or repair. + +As a primary concern, self describing metadata needs some form of overall +integrity checking. We cannot trust the metadata if we cannot verify that it has +not been changed as a result of external influences. Hence we need some form of +integrity check, and this is done by adding CRC32c validation to the metadata +block. If we can verify the block contains the metadata it was intended to +contain, a large amount of the manual verification work can be skipped. + +CRC32c was selected as metadata cannot be more than 64k in length in XFS and +hence a 32 bit CRC is more than sufficient to detect multi-bit errors in +metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is +fast. So while CRC32c is not the strongest of possible integrity checks that +could be used, it is more than sufficient for our needs and has relatively +little overhead. Adding support for larger integrity fields and/or algorithms +does really provide any extra value over CRC32c, but it does add a lot of +complexity and so there is no provision for changing the integrity checking +mechanism. + +Self describing metadata needs to contain enough information so that the +metadata block can be verified as being in the correct place without needing to +look at any other metadata. This means it needs to contain location information. +Just adding a block number to the metadata is not sufficient to protect against +mis-directed writes - a write might be misdirected to the wrong LUN and so be +written to the "correct block" of the wrong filesystem. Hence location +information must contain a filesystem identifier as well as a block number. + +Another key information point in forensic analysis is knowing who the metadata +block belongs to. We already know the type, the location, that it is valid +and/or corrupted, and how long ago that it was last modified. Knowing the owner +of the block is important as it allows us to find other related metadata to +determine the scope of the corruption. For example, if we have a extent btree +object, we don't know what inode it belongs to and hence have to walk the entire +filesystem to find the owner of the block. Worse, the corruption could mean that +no owner can be found (i.e. it's an orphan block), and so without an owner field +in the metadata we have no idea of the scope of the corruption. If we have an +owner field in the metadata object, we can immediately do top down validation to +determine the scope of the problem. + +Different types of metadata have different owner identifiers. For example, +directory, attribute and extent tree blocks are all owned by an inode, while +freespace btree blocks are owned by an allocation group. Hence the size and +contents of the owner field are determined by the type of metadata object we are +looking at. The owner information can also identify misplaced writes (e.g. +freespace btree block written to the wrong AG). + +Self describing metadata also needs to contain some indication of when it was +written to the filesystem. One of the key information points when doing forensic +analysis is how recently the block was modified. Correlation of set of corrupted +metadata blocks based on modification times is important as it can indicate +whether the corruptions are related, whether there's been multiple corruption +events that lead to the eventual failure, and even whether there are corruptions +present that the run-time verification is not detecting. + +For example, we can determine whether a metadata object is supposed to be free +space or still allocated if it is still referenced by its owner by looking at +when the free space btree block that contains the block was last written +compared to when the metadata object itself was last written. If the free space +block is more recent than the object and the object's owner, then there is a +very good chance that the block should have been removed from the owner. + +To provide this "written timestamp", each metadata block gets the Log Sequence +Number (LSN) of the most recent transaction it was modified on written into it. +This number will always increase over the life of the filesystem, and the only +thing that resets it is running xfs_repair on the filesystem. Further, by use of +the LSN we can tell if the corrupted metadata all belonged to the same log +checkpoint and hence have some idea of how much modification occurred between +the first and last instance of corrupt metadata on disk and, further, how much +modification occurred between the corruption being written and when it was +detected. + +Runtime Validation +================== + +Validation of self-describing metadata takes place at runtime in two places: + + - immediately after a successful read from disk + - immediately prior to write IO submission + +The verification is completely stateless - it is done independently of the +modification process, and seeks only to check that the metadata is what it says +it is and that the metadata fields are within bounds and internally consistent. +As such, we cannot catch all types of corruption that can occur within a block +as there may be certain limitations that operational state enforces of the +metadata, or there may be corruption of interblock relationships (e.g. corrupted +sibling pointer lists). Hence we still need stateful checking in the main code +body, but in general most of the per-field validation is handled by the +verifiers. + +For read verification, the caller needs to specify the expected type of metadata +that it should see, and the IO completion process verifies that the metadata +object matches what was expected. If the verification process fails, then it +marks the object being read as EFSCORRUPTED. The caller needs to catch this +error (same as for IO errors), and if it needs to take special action due to a +verification error it can do so by catching the EFSCORRUPTED error value. If we +need more discrimination of error type at higher levels, we can define new +error numbers for different errors as necessary. + +The first step in read verification is checking the magic number and determining +whether CRC validating is necessary. If it is, the CRC32c is calculated and +compared against the value stored in the object itself. Once this is validated, +further checks are made against the location information, followed by extensive +object specific metadata validation. If any of these checks fail, then the +buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. + +Write verification is the opposite of the read verification - first the object +is extensively verified and if it is OK we then update the LSN from the last +modification made to the object, After this, we calculate the CRC and insert it +into the object. Once this is done the write IO is allowed to continue. If any +error occurs during this process, the buffer is again marked with a EFSCORRUPTED +error for the higher layers to catch. + +Structures +========== + +A typical on-disk structure needs to contain the following information:: + + struct xfs_ondisk_hdr { + __be32 magic; /* magic number */ + __be32 crc; /* CRC, not logged */ + uuid_t uuid; /* filesystem identifier */ + __be64 owner; /* parent object */ + __be64 blkno; /* location on disk */ + __be64 lsn; /* last modification in log, not logged */ + }; + +Depending on the metadata, this information may be part of a header structure +separate to the metadata contents, or may be distributed through an existing +structure. The latter occurs with metadata that already contains some of this +information, such as the superblock and AG headers. + +Other metadata may have different formats for the information, but the same +level of information is generally provided. For example: + + - short btree blocks have a 32 bit owner (ag number) and a 32 bit block + number for location. The two of these combined provide the same + information as @owner and @blkno in eh above structure, but using 8 + bytes less space on disk. + + - directory/attribute node blocks have a 16 bit magic number, and the + header that contains the magic number has other information in it as + well. hence the additional metadata headers change the overall format + of the metadata. + +A typical buffer read verifier is structured as follows:: + + #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) + + static void + xfs_foo_read_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + + if ((xfs_sb_version_hascrc(&mp->m_sb) && + !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), + XFS_FOO_CRC_OFF)) || + !xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + } + } + +The code ensures that the CRC is only checked if the filesystem has CRCs enabled +by checking the superblock of the feature bit, and then if the CRC verifies OK +(or is not needed) it verifies the actual contents of the block. + +The verifier function will take a couple of different forms, depending on +whether the magic number can be used to determine the format of the block. In +the case it can't, the code is structured as follows:: + + static bool + xfs_foo_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + if (!xfs_sb_version_hascrc(&mp->m_sb)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } + + /* object specific verification checks here */ + + return true; + } + +If there are different magic numbers for the different formats, the verifier +will look like:: + + static bool + xfs_foo_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + /* object specific verification checks here */ + + return true; + } + +Write verifiers are very similar to the read verifiers, they just do things in +the opposite order to the read verifiers. A typical write verifier:: + + static void + xfs_foo_write_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + struct xfs_buf_log_item *bip = bp->b_fspriv; + + if (!xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + return; + } + + if (!xfs_sb_version_hascrc(&mp->m_sb)) + return; + + + if (bip) { + struct xfs_ondisk_hdr *hdr = bp->b_addr; + hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); + } + xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); + } + +This will verify the internal structure of the metadata before we go any +further, detecting corruptions that have occurred as the metadata has been +modified in memory. If the metadata verifies OK, and CRCs are enabled, we then +update the LSN field (when it was last modified) and calculate the CRC on the +metadata. Once this is done, we can issue the IO. + +Inodes and Dquots +================= + +Inodes and dquots are special snowflakes. They have per-object CRC and +self-identifiers, but they are packed so that there are multiple objects per +buffer. Hence we do not use per-buffer verifiers to do the work of per-object +verification and CRC calculations. The per-buffer verifiers simply perform basic +identification of the buffer - that they contain inodes or dquots, and that +there are magic numbers in all the expected spots. All further CRC and +verification checks are done when each inode is read from or written back to the +buffer. + +The structure of the verifiers and the identifiers checks is very similar to the +buffer code described above. The only difference is where they are called. For +example, inode read verification is done in xfs_inode_from_disk() when the inode +is first read out of the buffer and the struct xfs_inode is instantiated. The +inode is already extensively verified during writeback in xfs_iflush_int, so the +only addition here is to add the LSN and CRC to the inode as it is copied back +into the buffer. + +XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of +the unlinked list modifications check or update CRCs, neither during unlink nor +log recovery. So, it's gone unnoticed until now. This won't matter immediately - +repair will probably complain about it - but it needs to be fixed. |