From 387b14684f94483cbbb72843db406ec9a8d0d6d2 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 10 Apr 2019 08:32:41 -0300 Subject: docs: locking: convert docs to ReST and rename to *.rst Convert the locking documents to ReST and add them to the kernel development book where it belongs. Most of the stuff here is just to make Sphinx to properly parse the text file, as they're already in good shape, not requiring massive changes in order to be parsed. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Federico Vaga --- Documentation/kernel-hacking/locking.rst | 2 +- Documentation/locking/index.rst | 24 + Documentation/locking/lockdep-design.rst | 394 ++++++++++++++ Documentation/locking/lockdep-design.txt | 389 -------------- Documentation/locking/lockstat.rst | 204 ++++++++ Documentation/locking/lockstat.txt | 183 ------- Documentation/locking/locktorture.rst | 170 ++++++ Documentation/locking/locktorture.txt | 145 ------ Documentation/locking/mutex-design.rst | 152 ++++++ Documentation/locking/mutex-design.txt | 142 ----- Documentation/locking/rt-mutex-design.rst | 574 +++++++++++++++++++++ Documentation/locking/rt-mutex-design.txt | 559 -------------------- Documentation/locking/rt-mutex.rst | 77 +++ Documentation/locking/rt-mutex.txt | 73 --- Documentation/locking/spinlocks.rst | 177 +++++++ Documentation/locking/spinlocks.txt | 167 ------ Documentation/locking/ww-mutex-design.rst | 393 ++++++++++++++ Documentation/locking/ww-mutex-design.txt | 383 -------------- Documentation/pi-futex.txt | 2 +- .../translations/it_IT/kernel-hacking/locking.rst | 2 +- 20 files changed, 2168 insertions(+), 2044 deletions(-) create mode 100644 Documentation/locking/index.rst create mode 100644 Documentation/locking/lockdep-design.rst delete mode 100644 Documentation/locking/lockdep-design.txt create mode 100644 Documentation/locking/lockstat.rst delete mode 100644 Documentation/locking/lockstat.txt create mode 100644 Documentation/locking/locktorture.rst delete mode 100644 Documentation/locking/locktorture.txt create mode 100644 Documentation/locking/mutex-design.rst delete mode 100644 Documentation/locking/mutex-design.txt create mode 100644 Documentation/locking/rt-mutex-design.rst delete mode 100644 Documentation/locking/rt-mutex-design.txt create mode 100644 Documentation/locking/rt-mutex.rst delete mode 100644 Documentation/locking/rt-mutex.txt create mode 100644 Documentation/locking/spinlocks.rst delete mode 100644 Documentation/locking/spinlocks.txt create mode 100644 Documentation/locking/ww-mutex-design.rst delete mode 100644 Documentation/locking/ww-mutex-design.txt (limited to 'Documentation') diff --git a/Documentation/kernel-hacking/locking.rst b/Documentation/kernel-hacking/locking.rst index dc698ea456e0..a8518ac0d31d 100644 --- a/Documentation/kernel-hacking/locking.rst +++ b/Documentation/kernel-hacking/locking.rst @@ -1364,7 +1364,7 @@ Futex API reference Further reading =============== -- ``Documentation/locking/spinlocks.txt``: Linus Torvalds' spinlocking +- ``Documentation/locking/spinlocks.rst``: Linus Torvalds' spinlocking tutorial in the kernel sources. - Unix Systems for Modern Architectures: Symmetric Multiprocessing and diff --git a/Documentation/locking/index.rst b/Documentation/locking/index.rst new file mode 100644 index 000000000000..ef5da7fe9aac --- /dev/null +++ b/Documentation/locking/index.rst @@ -0,0 +1,24 @@ +:orphan: + +======= +locking +======= + +.. toctree:: + :maxdepth: 1 + + lockdep-design + lockstat + locktorture + mutex-design + rt-mutex-design + rt-mutex + spinlocks + ww-mutex-design + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/locking/lockdep-design.rst b/Documentation/locking/lockdep-design.rst new file mode 100644 index 000000000000..23fcbc4d3fc0 --- /dev/null +++ b/Documentation/locking/lockdep-design.rst @@ -0,0 +1,394 @@ +Runtime locking correctness validator +===================================== + +started by Ingo Molnar + +additions by Arjan van de Ven + +Lock-class +---------- + +The basic object the validator operates upon is a 'class' of locks. + +A class of locks is a group of locks that are logically the same with +respect to locking rules, even if the locks may have multiple (possibly +tens of thousands of) instantiations. For example a lock in the inode +struct is one class, while each inode has its own instantiation of that +lock class. + +The validator tracks the 'usage state' of lock-classes, and it tracks +the dependencies between different lock-classes. Lock usage indicates +how a lock is used with regard to its IRQ contexts, while lock +dependency can be understood as lock order, where L1 -> L2 suggests that +a task is attempting to acquire L2 while holding L1. From lockdep's +perspective, the two locks (L1 and L2) are not necessarily related; that +dependency just means the order ever happened. The validator maintains a +continuing effort to prove lock usages and dependencies are correct or +the validator will shoot a splat if incorrect. + +A lock-class's behavior is constructed by its instances collectively: +when the first instance of a lock-class is used after bootup the class +gets registered, then all (subsequent) instances will be mapped to the +class and hence their usages and dependecies will contribute to those of +the class. A lock-class does not go away when a lock instance does, but +it can be removed if the memory space of the lock class (static or +dynamic) is reclaimed, this happens for example when a module is +unloaded or a workqueue is destroyed. + +State +----- + +The validator tracks lock-class usage history and divides the usage into +(4 usages * n STATEs + 1) categories: + +where the 4 usages can be: +- 'ever held in STATE context' +- 'ever held as readlock in STATE context' +- 'ever held with STATE enabled' +- 'ever held as readlock with STATE enabled' + +where the n STATEs are coded in kernel/locking/lockdep_states.h and as of +now they include: +- hardirq +- softirq + +where the last 1 category is: +- 'ever used' [ == !unused ] + +When locking rules are violated, these usage bits are presented in the +locking error messages, inside curlies, with a total of 2 * n STATEs bits. +A contrived example:: + + modprobe/2287 is trying to acquire lock: + (&sio_locks[i].lock){-.-.}, at: [] mutex_lock+0x21/0x24 + + but task is already holding lock: + (&sio_locks[i].lock){-.-.}, at: [] mutex_lock+0x21/0x24 + + +For a given lock, the bit positions from left to right indicate the usage +of the lock and readlock (if exists), for each of the n STATEs listed +above respectively, and the character displayed at each bit position +indicates: + + === =================================================== + '.' acquired while irqs disabled and not in irq context + '-' acquired in irq context + '+' acquired with irqs enabled + '?' acquired in irq context with irqs enabled. + === =================================================== + +The bits are illustrated with an example:: + + (&sio_locks[i].lock){-.-.}, at: [] mutex_lock+0x21/0x24 + |||| + ||| \-> softirq disabled and not in softirq context + || \--> acquired in softirq context + | \---> hardirq disabled and not in hardirq context + \----> acquired in hardirq context + + +For a given STATE, whether the lock is ever acquired in that STATE +context and whether that STATE is enabled yields four possible cases as +shown in the table below. The bit character is able to indicate which +exact case is for the lock as of the reporting time. + + +--------------+-------------+--------------+ + | | irq enabled | irq disabled | + +--------------+-------------+--------------+ + | ever in irq | ? | - | + +--------------+-------------+--------------+ + | never in irq | + | . | + +--------------+-------------+--------------+ + +The character '-' suggests irq is disabled because if otherwise the +charactor '?' would have been shown instead. Similar deduction can be +applied for '+' too. + +Unused locks (e.g., mutexes) cannot be part of the cause of an error. + + +Single-lock state rules: +------------------------ + +A lock is irq-safe means it was ever used in an irq context, while a lock +is irq-unsafe means it was ever acquired with irq enabled. + +A softirq-unsafe lock-class is automatically hardirq-unsafe as well. The +following states must be exclusive: only one of them is allowed to be set +for any lock-class based on its usage:: + + or + or + +This is because if a lock can be used in irq context (irq-safe) then it +cannot be ever acquired with irq enabled (irq-unsafe). Otherwise, a +deadlock may happen. For example, in the scenario that after this lock +was acquired but before released, if the context is interrupted this +lock will be attempted to acquire twice, which creates a deadlock, +referred to as lock recursion deadlock. + +The validator detects and reports lock usage that violates these +single-lock state rules. + +Multi-lock dependency rules: +---------------------------- + +The same lock-class must not be acquired twice, because this could lead +to lock recursion deadlocks. + +Furthermore, two locks can not be taken in inverse order:: + + -> + -> + +because this could lead to a deadlock - referred to as lock inversion +deadlock - as attempts to acquire the two locks form a circle which +could lead to the two contexts waiting for each other permanently. The +validator will find such dependency circle in arbitrary complexity, +i.e., there can be any other locking sequence between the acquire-lock +operations; the validator will still find whether these locks can be +acquired in a circular fashion. + +Furthermore, the following usage based lock dependencies are not allowed +between any two lock-classes:: + + -> + -> + +The first rule comes from the fact that a hardirq-safe lock could be +taken by a hardirq context, interrupting a hardirq-unsafe lock - and +thus could result in a lock inversion deadlock. Likewise, a softirq-safe +lock could be taken by an softirq context, interrupting a softirq-unsafe +lock. + +The above rules are enforced for any locking sequence that occurs in the +kernel: when acquiring a new lock, the validator checks whether there is +any rule violation between the new lock and any of the held locks. + +When a lock-class changes its state, the following aspects of the above +dependency rules are enforced: + +- if a new hardirq-safe lock is discovered, we check whether it + took any hardirq-unsafe lock in the past. + +- if a new softirq-safe lock is discovered, we check whether it took + any softirq-unsafe lock in the past. + +- if a new hardirq-unsafe lock is discovered, we check whether any + hardirq-safe lock took it in the past. + +- if a new softirq-unsafe lock is discovered, we check whether any + softirq-safe lock took it in the past. + +(Again, we do these checks too on the basis that an interrupt context +could interrupt _any_ of the irq-unsafe or hardirq-unsafe locks, which +could lead to a lock inversion deadlock - even if that lock scenario did +not trigger in practice yet.) + +Exception: Nested data dependencies leading to nested locking +------------------------------------------------------------- + +There are a few cases where the Linux kernel acquires more than one +instance of the same lock-class. Such cases typically happen when there +is some sort of hierarchy within objects of the same type. In these +cases there is an inherent "natural" ordering between the two objects +(defined by the properties of the hierarchy), and the kernel grabs the +locks in this fixed order on each of the objects. + +An example of such an object hierarchy that results in "nested locking" +is that of a "whole disk" block-dev object and a "partition" block-dev +object; the partition is "part of" the whole device and as long as one +always takes the whole disk lock as a higher lock than the partition +lock, the lock ordering is fully correct. The validator does not +automatically detect this natural ordering, as the locking rule behind +the ordering is not static. + +In order to teach the validator about this correct usage model, new +versions of the various locking primitives were added that allow you to +specify a "nesting level". An example call, for the block device mutex, +looks like this:: + + enum bdev_bd_mutex_lock_class + { + BD_MUTEX_NORMAL, + BD_MUTEX_WHOLE, + BD_MUTEX_PARTITION + }; + +mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION); + +In this case the locking is done on a bdev object that is known to be a +partition. + +The validator treats a lock that is taken in such a nested fashion as a +separate (sub)class for the purposes of validation. + +Note: When changing code to use the _nested() primitives, be careful and +check really thoroughly that the hierarchy is correctly mapped; otherwise +you can get false positives or false negatives. + +Annotations +----------- + +Two constructs can be used to annotate and check where and if certain locks +must be held: lockdep_assert_held*(&lock) and lockdep_*pin_lock(&lock). + +As the name suggests, lockdep_assert_held* family of macros assert that a +particular lock is held at a certain time (and generate a WARN() otherwise). +This annotation is largely used all over the kernel, e.g. kernel/sched/ +core.c:: + + void update_rq_clock(struct rq *rq) + { + s64 delta; + + lockdep_assert_held(&rq->lock); + [...] + } + +where holding rq->lock is required to safely update a rq's clock. + +The other family of macros is lockdep_*pin_lock(), which is admittedly only +used for rq->lock ATM. Despite their limited adoption these annotations +generate a WARN() if the lock of interest is "accidentally" unlocked. This turns +out to be especially helpful to debug code with callbacks, where an upper +layer assumes a lock remains taken, but a lower layer thinks it can maybe drop +and reacquire the lock ("unwittingly" introducing races). lockdep_pin_lock() +returns a 'struct pin_cookie' that is then used by lockdep_unpin_lock() to check +that nobody tampered with the lock, e.g. kernel/sched/sched.h:: + + static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf) + { + rf->cookie = lockdep_pin_lock(&rq->lock); + [...] + } + + static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf) + { + [...] + lockdep_unpin_lock(&rq->lock, rf->cookie); + } + +While comments about locking requirements might provide useful information, +the runtime checks performed by annotations are invaluable when debugging +locking problems and they carry the same level of details when inspecting +code. Always prefer annotations when in doubt! + +Proof of 100% correctness: +-------------------------- + +The validator achieves perfect, mathematical 'closure' (proof of locking +correctness) in the sense that for every simple, standalone single-task +locking sequence that occurred at least once during the lifetime of the +kernel, the validator proves it with a 100% certainty that no +combination and timing of these locking sequences can cause any class of +lock related deadlock. [1]_ + +I.e. complex multi-CPU and multi-task locking scenarios do not have to +occur in practice to prove a deadlock: only the simple 'component' +locking chains have to occur at least once (anytime, in any +task/context) for the validator to be able to prove correctness. (For +example, complex deadlocks that would normally need more than 3 CPUs and +a very unlikely constellation of tasks, irq-contexts and timings to +occur, can be detected on a plain, lightly loaded single-CPU system as +well!) + +This radically decreases the complexity of locking related QA of the +kernel: what has to be done during QA is to trigger as many "simple" +single-task locking dependencies in the kernel as possible, at least +once, to prove locking correctness - instead of having to trigger every +possible combination of locking interaction between CPUs, combined with +every possible hardirq and softirq nesting scenario (which is impossible +to do in practice). + +.. [1] + + assuming that the validator itself is 100% correct, and no other + part of the system corrupts the state of the validator in any way. + We also assume that all NMI/SMM paths [which could interrupt + even hardirq-disabled codepaths] are correct and do not interfere + with the validator. We also assume that the 64-bit 'chain hash' + value is unique for every lock-chain in the system. Also, lock + recursion must not be higher than 20. + +Performance: +------------ + +The above rules require **massive** amounts of runtime checking. If we did +that for every lock taken and for every irqs-enable event, it would +render the system practically unusably slow. The complexity of checking +is O(N^2), so even with just a few hundred lock-classes we'd have to do +tens of thousands of checks for every event. + +This problem is solved by checking any given 'locking scenario' (unique +sequence of locks taken after each other) only once. A simple stack of +held locks is maintained, and a lightweight 64-bit hash value is +calculated, which hash is unique for every lock chain. The hash value, +when the chain is validated for the first time, is then put into a hash +table, which hash-table can be checked in a lockfree manner. If the +locking chain occurs again later on, the hash table tells us that we +don't have to validate the chain again. + +Troubleshooting: +---------------- + +The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes. +Exceeding this number will trigger the following lockdep warning: + + (DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS)) + +By default, MAX_LOCKDEP_KEYS is currently set to 8191, and typical +desktop systems have less than 1,000 lock classes, so this warning +normally results from lock-class leakage or failure to properly +initialize locks. These two problems are illustrated below: + +1. Repeated module loading and unloading while running the validator + will result in lock-class leakage. The issue here is that each + load of the module will create a new set of lock classes for + that module's locks, but module unloading does not remove old + classes (see below discussion of reuse of lock classes for why). + Therefore, if that module is loaded and unloaded repeatedly, + the number of lock classes will eventually reach the maximum. + +2. Using structures such as arrays that have large numbers of + locks that are not explicitly initialized. For example, + a hash table with 8192 buckets where each bucket has its own + spinlock_t will consume 8192 lock classes -unless- each spinlock + is explicitly initialized at runtime, for example, using the + run-time spin_lock_init() as opposed to compile-time initializers + such as __SPIN_LOCK_UNLOCKED(). Failure to properly initialize + the per-bucket spinlocks would guarantee lock-class overflow. + In contrast, a loop that called spin_lock_init() on each lock + would place all 8192 locks into a single lock class. + + The moral of this story is that you should always explicitly + initialize your locks. + +One might argue that the validator should be modified to allow +lock classes to be reused. However, if you are tempted to make this +argument, first review the code and think through the changes that would +be required, keeping in mind that the lock classes to be removed are +likely to be linked into the lock-dependency graph. This turns out to +be harder to do than to say. + +Of course, if you do run out of lock classes, the next thing to do is +to find the offending lock classes. First, the following command gives +you the number of lock classes currently in use along with the maximum:: + + grep "lock-classes" /proc/lockdep_stats + +This command produces the following output on a modest system:: + + lock-classes: 748 [max: 8191] + +If the number allocated (748 above) increases continually over time, +then there is likely a leak. The following command can be used to +identify the leaking lock classes:: + + grep "BD" /proc/lockdep + +Run the command and save the output, then compare against the output from +a later run of this command to identify the leakers. This same output +can also help you find situations where runtime lock initialization has +been omitted. diff --git a/Documentation/locking/lockdep-design.txt b/Documentation/locking/lockdep-design.txt deleted file mode 100644 index f189d130e543..000000000000 --- a/Documentation/locking/lockdep-design.txt +++ /dev/null @@ -1,389 +0,0 @@ -Runtime locking correctness validator -===================================== - -started by Ingo Molnar -additions by Arjan van de Ven - -Lock-class ----------- - -The basic object the validator operates upon is a 'class' of locks. - -A class of locks is a group of locks that are logically the same with -respect to locking rules, even if the locks may have multiple (possibly -tens of thousands of) instantiations. For example a lock in the inode -struct is one class, while each inode has its own instantiation of that -lock class. - -The validator tracks the 'usage state' of lock-classes, and it tracks -the dependencies between different lock-classes. Lock usage indicates -how a lock is used with regard to its IRQ contexts, while lock -dependency can be understood as lock order, where L1 -> L2 suggests that -a task is attempting to acquire L2 while holding L1. From lockdep's -perspective, the two locks (L1 and L2) are not necessarily related; that -dependency just means the order ever happened. The validator maintains a -continuing effort to prove lock usages and dependencies are correct or -the validator will shoot a splat if incorrect. - -A lock-class's behavior is constructed by its instances collectively: -when the first instance of a lock-class is used after bootup the class -gets registered, then all (subsequent) instances will be mapped to the -class and hence their usages and dependecies will contribute to those of -the class. A lock-class does not go away when a lock instance does, but -it can be removed if the memory space of the lock class (static or -dynamic) is reclaimed, this happens for example when a module is -unloaded or a workqueue is destroyed. - -State ------ - -The validator tracks lock-class usage history and divides the usage into -(4 usages * n STATEs + 1) categories: - -where the 4 usages can be: -- 'ever held in STATE context' -- 'ever held as readlock in STATE context' -- 'ever held with STATE enabled' -- 'ever held as readlock with STATE enabled' - -where the n STATEs are coded in kernel/locking/lockdep_states.h and as of -now they include: -- hardirq -- softirq - -where the last 1 category is: -- 'ever used' [ == !unused ] - -When locking rules are violated, these usage bits are presented in the -locking error messages, inside curlies, with a total of 2 * n STATEs bits. -A contrived example: - - modprobe/2287 is trying to acquire lock: - (&sio_locks[i].lock){-.-.}, at: [] mutex_lock+0x21/0x24 - - but task is already holding lock: - (&sio_locks[i].lock){-.-.}, at: [] mutex_lock+0x21/0x24 - - -For a given lock, the bit positions from left to right indicate the usage -of the lock and readlock (if exists), for each of the n STATEs listed -above respectively, and the character displayed at each bit position -indicates: - - '.' acquired while irqs disabled and not in irq context - '-' acquired in irq context - '+' acquired with irqs enabled - '?' acquired in irq context with irqs enabled. - -The bits are illustrated with an example: - - (&sio_locks[i].lock){-.-.}, at: [] mutex_lock+0x21/0x24 - |||| - ||| \-> softirq disabled and not in softirq context - || \--> acquired in softirq context - | \---> hardirq disabled and not in hardirq context - \----> acquired in hardirq context - - -For a given STATE, whether the lock is ever acquired in that STATE -context and whether that STATE is enabled yields four possible cases as -shown in the table below. The bit character is able to indicate which -exact case is for the lock as of the reporting time. - - ------------------------------------------- - | | irq enabled | irq disabled | - |-------------------------------------------| - | ever in irq | ? | - | - |-------------------------------------------| - | never in irq | + | . | - ------------------------------------------- - -The character '-' suggests irq is disabled because if otherwise the -charactor '?' would have been shown instead. Similar deduction can be -applied for '+' too. - -Unused locks (e.g., mutexes) cannot be part of the cause of an error. - - -Single-lock state rules: ------------------------- - -A lock is irq-safe means it was ever used in an irq context, while a lock -is irq-unsafe means it was ever acquired with irq enabled. - -A softirq-unsafe lock-class is automatically hardirq-unsafe as well. The -following states must be exclusive: only one of them is allowed to be set -for any lock-class based on its usage: - - or - or - -This is because if a lock can be used in irq context (irq-safe) then it -cannot be ever acquired with irq enabled (irq-unsafe). Otherwise, a -deadlock may happen. For example, in the scenario that after this lock -was acquired but before released, if the context is interrupted this -lock will be attempted to acquire twice, which creates a deadlock, -referred to as lock recursion deadlock. - -The validator detects and reports lock usage that violates these -single-lock state rules. - -Multi-lock dependency rules: ----------------------------- - -The same lock-class must not be acquired twice, because this could lead -to lock recursion deadlocks. - -Furthermore, two locks can not be taken in inverse order: - - -> - -> - -because this could lead to a deadlock - referred to as lock inversion -deadlock - as attempts to acquire the two locks form a circle which -could lead to the two contexts waiting for each other permanently. The -validator will find such dependency circle in arbitrary complexity, -i.e., there can be any other locking sequence between the acquire-lock -operations; the validator will still find whether these locks can be -acquired in a circular fashion. - -Furthermore, the following usage based lock dependencies are not allowed -between any two lock-classes: - - -> - -> - -The first rule comes from the fact that a hardirq-safe lock could be -taken by a hardirq context, interrupting a hardirq-unsafe lock - and -thus could result in a lock inversion deadlock. Likewise, a softirq-safe -lock could be taken by an softirq context, interrupting a softirq-unsafe -lock. - -The above rules are enforced for any locking sequence that occurs in the -kernel: when acquiring a new lock, the validator checks whether there is -any rule violation between the new lock and any of the held locks. - -When a lock-class changes its state, the following aspects of the above -dependency rules are enforced: - -- if a new hardirq-safe lock is discovered, we check whether it - took any hardirq-unsafe lock in the past. - -- if a new softirq-safe lock is discovered, we check whether it took - any softirq-unsafe lock in the past. - -- if a new hardirq-unsafe lock is discovered, we check whether any - hardirq-safe lock took it in the past. - -- if a new softirq-unsafe lock is discovered, we check whether any - softirq-safe lock took it in the past. - -(Again, we do these checks too on the basis that an interrupt context -could interrupt _any_ of the irq-unsafe or hardirq-unsafe locks, which -could lead to a lock inversion deadlock - even if that lock scenario did -not trigger in practice yet.) - -Exception: Nested data dependencies leading to nested locking -------------------------------------------------------------- - -There are a few cases where the Linux kernel acquires more than one -instance of the same lock-class. Such cases typically happen when there -is some sort of hierarchy within objects of the same type. In these -cases there is an inherent "natural" ordering between the two objects -(defined by the properties of the hierarchy), and the kernel grabs the -locks in this fixed order on each of the objects. - -An example of such an object hierarchy that results in "nested locking" -is that of a "whole disk" block-dev object and a "partition" block-dev -object; the partition is "part of" the whole device and as long as one -always takes the whole disk lock as a higher lock than the partition -lock, the lock ordering is fully correct. The validator does not -automatically detect this natural ordering, as the locking rule behind -the ordering is not static. - -In order to teach the validator about this correct usage model, new -versions of the various locking primitives were added that allow you to -specify a "nesting level". An example call, for the block device mutex, -looks like this: - -enum bdev_bd_mutex_lock_class -{ - BD_MUTEX_NORMAL, - BD_MUTEX_WHOLE, - BD_MUTEX_PARTITION -}; - - mutex_lock_nested(&bdev->bd_contains->bd_mutex, BD_MUTEX_PARTITION); - -In this case the locking is done on a bdev object that is known to be a -partition. - -The validator treats a lock that is taken in such a nested fashion as a -separate (sub)class for the purposes of validation. - -Note: When changing code to use the _nested() primitives, be careful and -check really thoroughly that the hierarchy is correctly mapped; otherwise -you can get false positives or false negatives. - -Annotations ------------ - -Two constructs can be used to annotate and check where and if certain locks -must be held: lockdep_assert_held*(&lock) and lockdep_*pin_lock(&lock). - -As the name suggests, lockdep_assert_held* family of macros assert that a -particular lock is held at a certain time (and generate a WARN() otherwise). -This annotation is largely used all over the kernel, e.g. kernel/sched/ -core.c - - void update_rq_clock(struct rq *rq) - { - s64 delta; - - lockdep_assert_held(&rq->lock); - [...] - } - -where holding rq->lock is required to safely update a rq's clock. - -The other family of macros is lockdep_*pin_lock(), which is admittedly only -used for rq->lock ATM. Despite their limited adoption these annotations -generate a WARN() if the lock of interest is "accidentally" unlocked. This turns -out to be especially helpful to debug code with callbacks, where an upper -layer assumes a lock remains taken, but a lower layer thinks it can maybe drop -and reacquire the lock ("unwittingly" introducing races). lockdep_pin_lock() -returns a 'struct pin_cookie' that is then used by lockdep_unpin_lock() to check -that nobody tampered with the lock, e.g. kernel/sched/sched.h - - static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf) - { - rf->cookie = lockdep_pin_lock(&rq->lock); - [...] - } - - static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf) - { - [...] - lockdep_unpin_lock(&rq->lock, rf->cookie); - } - -While comments about locking requirements might provide useful information, -the runtime checks performed by annotations are invaluable when debugging -locking problems and they carry the same level of details when inspecting -code. Always prefer annotations when in doubt! - -Proof of 100% correctness: --------------------------- - -The validator achieves perfect, mathematical 'closure' (proof of locking -correctness) in the sense that for every simple, standalone single-task -locking sequence that occurred at least once during the lifetime of the -kernel, the validator proves it with a 100% certainty that no -combination and timing of these locking sequences can cause any class of -lock related deadlock. [*] - -I.e. complex multi-CPU and multi-task locking scenarios do not have to -occur in practice to prove a deadlock: only the simple 'component' -locking chains have to occur at least once (anytime, in any -task/context) for the validator to be able to prove correctness. (For -example, complex deadlocks that would normally need more than 3 CPUs and -a very unlikely constellation of tasks, irq-contexts and timings to -occur, can be detected on a plain, lightly loaded single-CPU system as -well!) - -This radically decreases the complexity of locking related QA of the -kernel: what has to be done during QA is to trigger as many "simple" -single-task locking dependencies in the kernel as possible, at least -once, to prove locking correctness - instead of having to trigger every -possible combination of locking interaction between CPUs, combined with -every possible hardirq and softirq nesting scenario (which is impossible -to do in practice). - -[*] assuming that the validator itself is 100% correct, and no other - part of the system corrupts the state of the validator in any way. - We also assume that all NMI/SMM paths [which could interrupt - even hardirq-disabled codepaths] are correct and do not interfere - with the validator. We also assume that the 64-bit 'chain hash' - value is unique for every lock-chain in the system. Also, lock - recursion must not be higher than 20. - -Performance: ------------- - -The above rules require _massive_ amounts of runtime checking. If we did -that for every lock taken and for every irqs-enable event, it would -render the system practically unusably slow. The complexity of checking -is O(N^2), so even with just a few hundred lock-classes we'd have to do -tens of thousands of checks for every event. - -This problem is solved by checking any given 'locking scenario' (unique -sequence of locks taken after each other) only once. A simple stack of -held locks is maintained, and a lightweight 64-bit hash value is -calculated, which hash is unique for every lock chain. The hash value, -when the chain is validated for the first time, is then put into a hash -table, which hash-table can be checked in a lockfree manner. If the -locking chain occurs again later on, the hash table tells us that we -don't have to validate the chain again. - -Troubleshooting: ----------------- - -The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes. -Exceeding this number will trigger the following lockdep warning: - - (DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS)) - -By default, MAX_LOCKDEP_KEYS is currently set to 8191, and typical -desktop systems have less than 1,000 lock classes, so this warning -normally results from lock-class leakage or failure to properly -initialize locks. These two problems are illustrated below: - -1. Repeated module loading and unloading while running the validator - will result in lock-class leakage. The issue here is that each - load of the module will create a new set of lock classes for - that module's locks, but module unloading does not remove old - classes (see below discussion of reuse of lock classes for why). - Therefore, if that module is loaded and unloaded repeatedly, - the number of lock classes will eventually reach the maximum. - -2. Using structures such as arrays that have large numbers of - locks that are not explicitly initialized. For example, - a hash table with 8192 buckets where each bucket has its own - spinlock_t will consume 8192 lock classes -unless- each spinlock - is explicitly initialized at runtime, for example, using the - run-time spin_lock_init() as opposed to compile-time initializers - such as __SPIN_LOCK_UNLOCKED(). Failure to properly initialize - the per-bucket spinlocks would guarantee lock-class overflow. - In contrast, a loop that called spin_lock_init() on each lock - would place all 8192 locks into a single lock class. - - The moral of this story is that you should always explicitly - initialize your locks. - -One might argue that the validator should be modified to allow -lock classes to be reused. However, if you are tempted to make this -argument, first review the code and think through the changes that would -be required, keeping in mind that the lock classes to be removed are -likely to be linked into the lock-dependency graph. This turns out to -be harder to do than to say. - -Of course, if you do run out of lock classes, the next thing to do is -to find the offending lock classes. First, the following command gives -you the number of lock classes currently in use along with the maximum: - - grep "lock-classes" /proc/lockdep_stats - -This command produces the following output on a modest system: - - lock-classes: 748 [max: 8191] - -If the number allocated (748 above) increases continually over time, -then there is likely a leak. The following command can be used to -identify the leaking lock classes: - - grep "BD" /proc/lockdep - -Run the command and save the output, then compare against the output from -a later run of this command to identify the leakers. This same output -can also help you find situations where runtime lock initialization has -been omitted. diff --git a/Documentation/locking/lockstat.rst b/Documentation/locking/lockstat.rst new file mode 100644 index 000000000000..536eab8dbd99 --- /dev/null +++ b/Documentation/locking/lockstat.rst @@ -0,0 +1,204 @@ +=============== +Lock Statistics +=============== + +What +==== + +As the name suggests, it provides statistics on locks. + + +Why +=== + +Because things like lock contention can severely impact performance. + +How +=== + +Lockdep already has hooks in the lock functions and maps lock instances to +lock classes. We build on that (see Documentation/locking/lockdep-design.rst). +The graph below shows the relation between the lock functions and the various +hooks therein:: + + __acquire + | + lock _____ + | \ + | __contended + | | + | + | _______/ + |/ + | + __acquired + | + . + + . + | + __release + | + unlock + + lock, unlock - the regular lock functions + __* - the hooks + <> - states + +With these hooks we provide the following statistics: + + con-bounces + - number of lock contention that involved x-cpu data + contentions + - number of lock acquisitions that had to wait + wait time + min + - shortest (non-0) time we ever had to wait for a lock + max + - longest time we ever had to wait for a lock + total + - total time we spend waiting on this lock + avg + - average time spent waiting on this lock + acq-bounces + - number of lock acquisitions that involved x-cpu data + acquisitions + - number of times we took the lock + hold time + min + - shortest (non-0) time we ever held the lock + max + - longest time we ever held the lock + total + - total time this lock was held + avg + - average time this lock was held + +These numbers are gathered per lock class, per read/write state (when +applicable). + +It also tracks 4 contention points per class. A contention point is a call site +that had to wait on lock acquisition. + +Configuration +------------- + +Lock statistics are enabled via CONFIG_LOCK_STAT. + +Usage +----- + +Enable collection of statistics:: + + # echo 1 >/proc/sys/kernel/lock_stat + +Disable collection of statistics:: + + # echo 0 >/proc/sys/kernel/lock_stat + +Look at the current lock statistics:: + + ( line numbers not part of actual output, done for clarity in the explanation + below ) + + # less /proc/lock_stat + + 01 lock_stat version 0.4 + 02----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + 03 class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg + 04----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + 05 + 06 &mm->mmap_sem-W: 46 84 0.26 939.10 16371.53 194.90 47291 2922365 0.16 2220301.69 17464026916.32 5975.99 + 07 &mm->mmap_sem-R: 37 100 1.31 299502.61 325629.52 3256.30 212344 34316685 0.10 7744.91 95016910.20 2.77 + 08 --------------- + 09 &mm->mmap_sem 1 [] khugepaged_scan_mm_slot+0x57/0x280 + 10 &mm->mmap_sem 96 [] __do_page_fault+0x1d4/0x510 + 11 &mm->mmap_sem 34 [] vm_mmap_pgoff+0x87/0xd0 + 12 &mm->mmap_sem 17 [] vm_munmap+0x41/0x80 + 13 --------------- + 14 &mm->mmap_sem 1 [] dup_mmap+0x2a/0x3f0 + 15 &mm->mmap_sem 60 [] SyS_mprotect+0xe9/0x250 + 16 &mm->mmap_sem 41 [] __do_page_fault+0x1d4/0x510 + 17 &mm->mmap_sem 68 [] vm_mmap_pgoff+0x87/0xd0 + 18 + 19............................................................................................................................................................................................................................. + 20 + 21 unix_table_lock: 110 112 0.21 49.24 163.91 1.46 21094 66312 0.12 624.42 31589.81 0.48 + 22 --------------- + 23 unix_table_lock 45 [] unix_create1+0x16e/0x1b0 + 24 unix_table_lock 47 [] unix_release_sock+0x31/0x250 + 25 unix_table_lock 15 [] unix_find_other+0x117/0x230 + 26 unix_table_lock 5 [] unix_autobind+0x11f/0x1b0 + 27 --------------- + 28 unix_table_lock 39 [] unix_release_sock+0x31/0x250 + 29 unix_table_lock 49 [] unix_create1+0x16e/0x1b0 + 30 unix_table_lock 20 [] unix_find_other+0x117/0x230 + 31 unix_table_lock 4 [] unix_autobind+0x11f/0x1b0 + + +This excerpt shows the first two lock class statistics. Line 01 shows the +output version - each time the format changes this will be updated. Line 02-04 +show the header with column descriptions. Lines 05-18 and 20-31 show the actual +statistics. These statistics come in two parts; the actual stats separated by a +short separator (line 08, 13) from the contention points. + +Lines 09-12 show the first 4 recorded contention points (the code +which tries to get the lock) and lines 14-17 show the first 4 recorded +contended points (the lock holder). It is possible that the max +con-bounces point is missing in the statistics. + +The first lock (05-18) is a read/write lock, and shows two lines above the +short separator. The contention points don't match the column descriptors, +they have two: contentions and [] symbol. The second set of contention +points are the points we're contending with. + +The integer part of the time values is in us. + +Dealing with nested locks, subclasses may appear:: + + 32........................................................................................................................................................................................................................... + 33 + 34 &rq->lock: 13128 13128 0.43 190.53 103881.26 7.91 97454 3453404 0.00 401.11 13224683.11 3.82 + 35 --------- + 36 &rq->lock 645 [] task_rq_lock+0x43/0x75 + 37 &rq->lock 297 [] try_to_wake_up+0x127/0x25a + 38 &rq->lock 360 [] select_task_rq_fair+0x1f0/0x74a + 39 &rq->lock 428 [] scheduler_tick+0x46/0x1fb + 40 --------- + 41 &rq->lock 77 [] task_rq_lock+0x43/0x75 + 42 &rq->lock 174 [] try_to_wake_up+0x127/0x25a + 43 &rq->lock 4715 [] double_rq_lock+0x42/0x54 + 44 &rq->lock 893 [] schedule+0x157/0x7b8 + 45 + 46........................................................................................................................................................................................................................... + 47 + 48 &rq->lock/1: 1526 11488 0.33 388.73 136294.31 11.86 21461 38404 0.00 37.93 109388.53 2.84 + 49 ----------- + 50 &rq->lock/1 11526 [] double_rq_lock+0x4f/0x54 + 51 ----------- + 52 &rq->lock/1 5645 [] double_rq_lock+0x42/0x54 + 53 &rq->lock/1 1224 [] schedule+0x157/0x7b8 + 54 &rq->lock/1 4336 [] double_rq_lock+0x4f/0x54 + 55 &rq->lock/1 181 [] try_to_wake_up+0x127/0x25a + +Line 48 shows statistics for the second subclass (/1) of &rq->lock class +(subclass starts from 0), since in this case, as line 50 suggests, +double_rq_lock actually acquires a nested lock of two spinlocks. + +View the top contending locks:: + + # grep : /proc/lock_stat | head + clockevents_lock: 2926159 2947636 0.15 46882.81 1784540466.34 605.41 3381345 3879161 0.00 2260.97 53178395.68 13.71 + tick_broadcast_lock: 346460 346717 0.18 2257.43 39364622.71 113.54 3642919 4242696 0.00 2263.79 49173646.60 11.59 + &mapping->i_mmap_mutex: 203896 203899 3.36 645530.05 31767507988.39 155800.21 3361776 8893984 0.17 2254.15 14110121.02 1.59 + &rq->lock: 135014 136909 0.18 606.09 842160.68 6.15 1540728 10436146 0.00 728.72 17606683.41 1.69 + &(&zone->lru_lock)->rlock: 93000 94934 0.16 59.18 188253.78 1.98 1199912 3809894 0.15 391.40 3559518.81 0.93 + tasklist_lock-W: 40667 41130 0.23 1189.42 428980.51 10.43 270278 510106 0.16 653.51 3939674.91 7.72 + tasklist_lock-R: 21298 21305 0.20 1310.05 215511.12 10.12 186204 241258 0.14 1162.33 1179779.23 4.89 + rcu_node_1: 47656 49022 0.16 635.41 193616.41 3.95 844888 1865423 0.00 764.26 1656226.96 0.89 + &(&dentry->d_lockref.lock)->rlock: 39791 40179 0.15 1302.08 88851.96 2.21 2790851 12527025 0.10 1910.75 3379714.27 0.27 + rcu_node_0: 29203 30064 0.16 786.55 1555573.00 51.74 88963 244254 0.00 398.87 428872.51 1.76 + +Clear the statistics:: + + # echo 0 > /proc/lock_stat diff --git a/Documentation/locking/lockstat.txt b/Documentation/locking/lockstat.txt deleted file mode 100644 index fdbeb0c45ef3..000000000000 --- a/Documentation/locking/lockstat.txt +++ /dev/null @@ -1,183 +0,0 @@ - -LOCK STATISTICS - -- WHAT - -As the name suggests, it provides statistics on locks. - -- WHY - -Because things like lock contention can severely impact performance. - -- HOW - -Lockdep already has hooks in the lock functions and maps lock instances to -lock classes. We build on that (see Documentation/locking/lockdep-design.txt). -The graph below shows the relation between the lock functions and the various -hooks therein. - - __acquire - | - lock _____ - | \ - | __contended - | | - | - | _______/ - |/ - | - __acquired - | - . - - . - | - __release - | - unlock - -lock, unlock - the regular lock functions -__* - the hooks -<> - states - -With these hooks we provide the following statistics: - - con-bounces - number of lock contention that involved x-cpu data - contentions - number of lock acquisitions that had to wait - wait time min - shortest (non-0) time we ever had to wait for a lock - max - longest time we ever had to wait for a lock - total - total time we spend waiting on this lock - avg - average time spent waiting on this lock - acq-bounces - number of lock acquisitions that involved x-cpu data - acquisitions - number of times we took the lock - hold time min - shortest (non-0) time we ever held the lock - max - longest time we ever held the lock - total - total time this lock was held - avg - average time this lock was held - -These numbers are gathered per lock class, per read/write state (when -applicable). - -It also tracks 4 contention points per class. A contention point is a call site -that had to wait on lock acquisition. - - - CONFIGURATION - -Lock statistics are enabled via CONFIG_LOCK_STAT. - - - USAGE - -Enable collection of statistics: - -# echo 1 >/proc/sys/kernel/lock_stat - -Disable collection of statistics: - -# echo 0 >/proc/sys/kernel/lock_stat - -Look at the current lock statistics: - -( line numbers not part of actual output, done for clarity in the explanation - below ) - -# less /proc/lock_stat - -01 lock_stat version 0.4 -02----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -03 class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg -04----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -05 -06 &mm->mmap_sem-W: 46 84 0.26 939.10 16371.53 194.90 47291 2922365 0.16 2220301.69 17464026916.32 5975.99 -07 &mm->mmap_sem-R: 37 100 1.31 299502.61 325629.52 3256.30 212344 34316685 0.10 7744.91 95016910.20 2.77 -08 --------------- -09 &mm->mmap_sem 1 [] khugepaged_scan_mm_slot+0x57/0x280 -10 &mm->mmap_sem 96 [] __do_page_fault+0x1d4/0x510 -11 &mm->mmap_sem 34 [] vm_mmap_pgoff+0x87/0xd0 -12 &mm->mmap_sem 17 [] vm_munmap+0x41/0x80 -13 --------------- -14 &mm->mmap_sem 1 [] dup_mmap+0x2a/0x3f0 -15 &mm->mmap_sem 60 [] SyS_mprotect+0xe9/0x250 -16 &mm->mmap_sem 41 [] __do_page_fault+0x1d4/0x510 -17 &mm->mmap_sem 68 [] vm_mmap_pgoff+0x87/0xd0 -18 -19............................................................................................................................................................................................................................. -20 -21 unix_table_lock: 110 112 0.21 49.24 163.91 1.46 21094 66312 0.12 624.42 31589.81 0.48 -22 --------------- -23 unix_table_lock 45 [] unix_create1+0x16e/0x1b0 -24 unix_table_lock 47 [] unix_release_sock+0x31/0x250 -25 unix_table_lock 15 [] unix_find_other+0x117/0x230 -26 unix_table_lock 5 [] unix_autobind+0x11f/0x1b0 -27 --------------- -28 unix_table_lock 39 [] unix_release_sock+0x31/0x250 -29 unix_table_lock 49 [] unix_create1+0x16e/0x1b0 -30 unix_table_lock 20 [] unix_find_other+0x117/0x230 -31 unix_table_lock 4 [] unix_autobind+0x11f/0x1b0 - - -This excerpt shows the first two lock class statistics. Line 01 shows the -output version - each time the format changes this will be updated. Line 02-04 -show the header with column descriptions. Lines 05-18 and 20-31 show the actual -statistics. These statistics come in two parts; the actual stats separated by a -short separator (line 08, 13) from the contention points. - -Lines 09-12 show the first 4 recorded contention points (the code -which tries to get the lock) and lines 14-17 show the first 4 recorded -contended points (the lock holder). It is possible that the max -con-bounces point is missing in the statistics. - -The first lock (05-18) is a read/write lock, and shows two lines above the -short separator. The contention points don't match the column descriptors, -they have two: contentions and [] symbol. The second set of contention -points are the points we're contending with. - -The integer part of the time values is in us. - -Dealing with nested locks, subclasses may appear: - -32........................................................................................................................................................................................................................... -33 -34 &rq->lock: 13128 13128 0.43 190.53 103881.26 7.91 97454 3453404 0.00 401.11 13224683.11 3.82 -35 --------- -36 &rq->lock 645 [] task_rq_lock+0x43/0x75 -37 &rq->lock 297 [] try_to_wake_up+0x127/0x25a -38 &rq->lock 360 [] select_task_rq_fair+0x1f0/0x74a -39 &rq->lock 428 [] scheduler_tick+0x46/0x1fb -40 --------- -41 &rq->lock 77 [] task_rq_lock+0x43/0x75 -42 &rq->lock 174 [] try_to_wake_up+0x127/0x25a -43 &rq->lock 4715 [] double_rq_lock+0x42/0x54 -44 &rq->lock 893 [] schedule+0x157/0x7b8 -45 -46........................................................................................................................................................................................................................... -47 -48 &rq->lock/1: 1526 11488 0.33 388.73 136294.31 11.86 21461 38404 0.00 37.93 109388.53 2.84 -49 ----------- -50 &rq->lock/1 11526 [] double_rq_lock+0x4f/0x54 -51 ----------- -52 &rq->lock/1 5645 [] double_rq_lock+0x42/0x54 -53 &rq->lock/1 1224 [] schedule+0x157/0x7b8 -54 &rq->lock/1 4336 [] double_rq_lock+0x4f/0x54 -55 &rq->lock/1 181 [] try_to_wake_up+0x127/0x25a - -Line 48 shows statistics for the second subclass (/1) of &rq->lock class -(subclass starts from 0), since in this case, as line 50 suggests, -double_rq_lock actually acquires a nested lock of two spinlocks. - -View the top contending locks: - -# grep : /proc/lock_stat | head - clockevents_lock: 2926159 2947636 0.15 46882.81 1784540466.34 605.41 3381345 3879161 0.00 2260.97 53178395.68 13.71 - tick_broadcast_lock: 346460 346717 0.18 2257.43 39364622.71 113.54 3642919 4242696 0.00 2263.79 49173646.60 11.59 - &mapping->i_mmap_mutex: 203896 203899 3.36 645530.05 31767507988.39 155800.21 3361776 8893984 0.17 2254.15 14110121.02 1.59 - &rq->lock: 135014 136909 0.18 606.09 842160.68 6.15 1540728 10436146 0.00 728.72 17606683.41 1.69 - &(&zone->lru_lock)->rlock: 93000 94934 0.16 59.18 188253.78 1.98 1199912 3809894 0.15 391.40 3559518.81 0.93 - tasklist_lock-W: 40667 41130 0.23 1189.42 428980.51 10.43 270278 510106 0.16 653.51 3939674.91 7.72 - tasklist_lock-R: 21298 21305 0.20 1310.05 215511.12 10.12 186204 241258 0.14 1162.33 1179779.23 4.89 - rcu_node_1: 47656 49022 0.16 635.41 193616.41 3.95 844888 1865423 0.00 764.26 1656226.96 0.89 - &(&dentry->d_lockref.lock)->rlock: 39791 40179 0.15 1302.08 88851.96 2.21 2790851 12527025 0.10 1910.75 3379714.27 0.27 - rcu_node_0: 29203 30064 0.16 786.55 1555573.00 51.74 88963 244254 0.00 398.87 428872.51 1.76 - -Clear the statistics: - -# echo 0 > /proc/lock_stat diff --git a/Documentation/locking/locktorture.rst b/Documentation/locking/locktorture.rst new file mode 100644 index 000000000000..e79eeeca3ac6 --- /dev/null +++ b/Documentation/locking/locktorture.rst @@ -0,0 +1,170 @@ +================================== +Kernel Lock Torture Test Operation +================================== + +CONFIG_LOCK_TORTURE_TEST +======================== + +The CONFIG LOCK_TORTURE_TEST config option provides a kernel module +that runs torture tests on core kernel locking primitives. The kernel +module, 'locktorture', may be built after the fact on the running +kernel to be tested, if desired. The tests periodically output status +messages via printk(), which can be examined via the dmesg (perhaps +grepping for "torture"). The test is started when the module is loaded, +and stops when the module is unloaded. This program is based on how RCU +is tortured, via rcutorture. + +This torture test consists of creating a number of kernel threads which +acquire the lock and hold it for specific amount of time, thus simulating +different critical region behaviors. The amount of contention on the lock +can be simulated by either enlarging this critical region hold time and/or +creating more kthreads. + + +Module Parameters +================= + +This module has the following parameters: + + +Locktorture-specific +-------------------- + +nwriters_stress + Number of kernel threads that will stress exclusive lock + ownership (writers). The default value is twice the number + of online CPUs. + +nreaders_stress + Number of kernel threads that will stress shared lock + ownership (readers). The default is the same amount of writer + locks. If the user did not specify nwriters_stress, then + both readers and writers be the amount of online CPUs. + +torture_type + Type of lock to torture. By default, only spinlocks will + be tortured. This module can torture the following locks, + with string values as follows: + + - "lock_busted": + Simulates a buggy lock implementation. + + - "spin_lock": + spin_lock() and spin_unlock() pairs. + + - "spin_lock_irq": + spin_lock_irq() and spin_unlock_irq() pairs. + + - "rw_lock": + read/write lock() and unlock() rwlock pairs. + + - "rw_lock_irq": + read/write lock_irq() and unlock_irq() + rwlock pairs. + + - "mutex_lock": + mutex_lock() and mutex_unlock() pairs. + + - "rtmutex_lock": + rtmutex_lock() and rtmutex_unlock() pairs. + Kernel must have CONFIG_RT_MUTEX=y. + + - "rwsem_lock": + read/write down() and up() semaphore pairs. + + +Torture-framework (RCU + locking) +--------------------------------- + +shutdown_secs + The number of seconds to run the test before terminating + the test and powering off the system. The default is + zero, which disables test termination and system shutdown. + This capability is useful for automated testing. + +onoff_interval + The number of seconds between each attempt to execute a + randomly selected CPU-hotplug operation. Defaults + to zero, which disables CPU hotplugging. In + CONFIG_HOTPLUG_CPU=n kernels, locktorture will silently + refuse to do any CPU-hotplug operations regardless of + what value is specified for onoff_interval. + +onoff_holdoff + The number of seconds to wait until starting CPU-hotplug + operations. This would normally only be used when + locktorture was built into the kernel and started + automatically at boot time, in which case it is useful + in order to avoid confusing boot-time code with CPUs + coming and going. This parameter is only useful if + CONFIG_HOTPLUG_CPU is enabled. + +stat_interval + Number of seconds between statistics-related printk()s. + By default, locktorture will report stats every 60 seconds. + Setting the interval to zero causes the statistics to + be printed -only- when the module is unloaded, and this + is the default. + +stutter + The length of time to run the test before pausing for this + same period of time. Defaults to "stutter=5", so as + to run and pause for (roughly) five-second intervals. + Specifying "stutter=0" causes the test to run continuously + without pausing, which is the old default behavior. + +shuffle_interval + The number of seconds to keep the test threads affinitied + to a particular subset of the CPUs, defaults to 3 seconds. + Used in conjunction with test_no_idle_hz. + +verbose + Enable verbose debugging printing, via printk(). Enabled + by default. This extra information is mostly related to + high-level errors and reports from the main 'torture' + framework. + + +Statistics +========== + +Statistics are printed in the following format:: + + spin_lock-torture: Writes: Total: 93746064 Max/Min: 0/0 Fail: 0 + (A) (B) (C) (D) (E) + + (A): Lock type that is being tortured -- torture_type parameter. + + (B): Number of writer lock acquisitions. If dealing with a read/write + primitive a second "Reads" statistics line is printed. + + (C): Number of times the lock was acquired. + + (D): Min and max number of times threads failed to acquire the lock. + + (E): true/false values if there were errors acquiring the lock. This should + -only- be positive if there is a bug in the locking primitive's + implementation. Otherwise a lock should never fail (i.e., spin_lock()). + Of course, the same applies for (C), above. A dummy example of this is + the "lock_busted" type. + +Usage +===== + +The following script may be used to torture locks:: + + #!/bin/sh + + modprobe locktorture + sleep 3600 + rmmod locktorture + dmesg | grep torture: + +The output can be manually inspected for the error flag of "!!!". +One could of course create a more elaborate script that automatically +checked for such errors. The "rmmod" command forces a "SUCCESS", +"FAILURE", or "RCU_HOTPLUG" indication to be printk()ed. The first +two are self-explanatory, while the last indicates that while there +were no locking failures, CPU-hotplug problems were detected. + +Also see: Documentation/RCU/torture.txt diff --git a/Documentation/locking/locktorture.txt b/Documentation/locking/locktorture.txt deleted file mode 100644 index 6a8df4cd19bf..000000000000 --- a/Documentation/locking/locktorture.txt +++ /dev/null @@ -1,145 +0,0 @@ -Kernel Lock Torture Test Operation - -CONFIG_LOCK_TORTURE_TEST - -The CONFIG LOCK_TORTURE_TEST config option provides a kernel module -that runs torture tests on core kernel locking primitives. The kernel -module, 'locktorture', may be built after the fact on the running -kernel to be tested, if desired. The tests periodically output status -messages via printk(), which can be examined via the dmesg (perhaps -grepping for "torture"). The test is started when the module is loaded, -and stops when the module is unloaded. This program is based on how RCU -is tortured, via rcutorture. - -This torture test consists of creating a number of kernel threads which -acquire the lock and hold it for specific amount of time, thus simulating -different critical region behaviors. The amount of contention on the lock -can be simulated by either enlarging this critical region hold time and/or -creating more kthreads. - - -MODULE PARAMETERS - -This module has the following parameters: - - - ** Locktorture-specific ** - -nwriters_stress Number of kernel threads that will stress exclusive lock - ownership (writers). The default value is twice the number - of online CPUs. - -nreaders_stress Number of kernel threads that will stress shared lock - ownership (readers). The default is the same amount of writer - locks. If the user did not specify nwriters_stress, then - both readers and writers be the amount of online CPUs. - -torture_type Type of lock to torture. By default, only spinlocks will - be tortured. This module can torture the following locks, - with string values as follows: - - o "lock_busted": Simulates a buggy lock implementation. - - o "spin_lock": spin_lock() and spin_unlock() pairs. - - o "spin_lock_irq": spin_lock_irq() and spin_unlock_irq() - pairs. - - o "rw_lock": read/write lock() and unlock() rwlock pairs. - - o "rw_lock_irq": read/write lock_irq() and unlock_irq() - rwlock pairs. - - o "mutex_lock": mutex_lock() and mutex_unlock() pairs. - - o "rtmutex_lock": rtmutex_lock() and rtmutex_unlock() - pairs. Kernel must have CONFIG_RT_MUTEX=y. - - o "rwsem_lock": read/write down() and up() semaphore pairs. - - - ** Torture-framework (RCU + locking) ** - -shutdown_secs The number of seconds to run the test before terminating - the test and powering off the system. The default is - zero, which disables test termination and system shutdown. - This capability is useful for automated testing. - -onoff_interval The number of seconds between each attempt to execute a - randomly selected CPU-hotplug operation. Defaults - to zero, which disables CPU hotplugging. In - CONFIG_HOTPLUG_CPU=n kernels, locktorture will silently - refuse to do any CPU-hotplug operations regardless of - what value is specified for onoff_interval. - -onoff_holdoff The number of seconds to wait until starting CPU-hotplug - operations. This would normally only be used when - locktorture was built into the kernel and started - automatically at boot time, in which case it is useful - in order to avoid confusing boot-time code with CPUs - coming and going. This parameter is only useful if - CONFIG_HOTPLUG_CPU is enabled. - -stat_interval Number of seconds between statistics-related printk()s. - By default, locktorture will report stats every 60 seconds. - Setting the interval to zero causes the statistics to - be printed -only- when the module is unloaded, and this - is the default. - -stutter The length of time to run the test before pausing for this - same period of time. Defaults to "stutter=5", so as - to run and pause for (roughly) five-second intervals. - Specifying "stutter=0" causes the test to run continuously - without pausing, which is the old default behavior. - -shuffle_interval The number of seconds to keep the test threads affinitied - to a particular subset of the CPUs, defaults to 3 seconds. - Used in conjunction with test_no_idle_hz. - -verbose Enable verbose debugging printing, via printk(). Enabled - by default. This extra information is mostly related to - high-level errors and reports from the main 'torture' - framework. - - -STATISTICS - -Statistics are printed in the following format: - -spin_lock-torture: Writes: Total: 93746064 Max/Min: 0/0 Fail: 0 - (A) (B) (C) (D) (E) - -(A): Lock type that is being tortured -- torture_type parameter. - -(B): Number of writer lock acquisitions. If dealing with a read/write primitive - a second "Reads" statistics line is printed. - -(C): Number of times the lock was acquired. - -(D): Min and max number of times threads failed to acquire the lock. - -(E): true/false values if there were errors acquiring the lock. This should - -only- be positive if there is a bug in the locking primitive's - implementation. Otherwise a lock should never fail (i.e., spin_lock()). - Of course, the same applies for (C), above. A dummy example of this is - the "lock_busted" type. - -USAGE - -The following script may be used to torture locks: - - #!/bin/sh - - modprobe locktorture - sleep 3600 - rmmod locktorture - dmesg | grep torture: - -The output can be manually inspected for the error flag of "!!!". -One could of course create a more elaborate script that automatically -checked for such errors. The "rmmod" command forces a "SUCCESS", -"FAILURE", or "RCU_HOTPLUG" indication to be printk()ed. The first -two are self-explanatory, while the last indicates that while there -were no locking failures, CPU-hotplug problems were detected. - -Also see: Documentation/RCU/torture.txt diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst new file mode 100644 index 000000000000..4d8236b81fa5 --- /dev/null +++ b/Documentation/locking/mutex-design.rst @@ -0,0 +1,152 @@ +======================= +Generic Mutex Subsystem +======================= + +started by Ingo Molnar + +updated by Davidlohr Bueso + +What are mutexes? +----------------- + +In the Linux kernel, mutexes refer to a particular locking primitive +that enforces serialization on shared memory systems, and not only to +the generic term referring to 'mutual exclusion' found in academia +or similar theoretical text books. Mutexes are sleeping locks which +behave similarly to binary semaphores, and were introduced in 2006[1] +as an alternative to these. This new data structure provided a number +of advantages, including simpler interfaces, and at that time smaller +code (see Disadvantages). + +[1] http://lwn.net/Articles/164802/ + +Implementation +-------------- + +Mutexes are represented by 'struct mutex', defined in include/linux/mutex.h +and implemented in kernel/locking/mutex.c. These locks use an atomic variable +(->owner) to keep track of the lock state during its lifetime. Field owner +actually contains `struct task_struct *` to the current lock owner and it is +therefore NULL if not currently owned. Since task_struct pointers are aligned +at at least L1_CACHE_BYTES, low bits (3) are used to store extra state (e.g., +if waiter list is non-empty). In its most basic form it also includes a +wait-queue and a spinlock that serializes access to it. Furthermore, +CONFIG_MUTEX_SPIN_ON_OWNER=y systems use a spinner MCS lock (->osq), described +below in (ii). + +When acquiring a mutex, there are three possible paths that can be +taken, depending on the state of the lock: + +(i) fastpath: tries to atomically acquire the lock by cmpxchg()ing the owner with + the current task. This only works in the uncontended case (cmpxchg() checks + against 0UL, so all 3 state bits above have to be 0). If the lock is + contended it goes to the next possible path. + +(ii) midpath: aka optimistic spinning, tries to spin for acquisition + while the lock owner is running and there are no other tasks ready + to run that have higher priority (need_resched). The rationale is + that if the lock owner is running, it is likely to release the lock + soon. The mutex spinners are queued up using MCS lock so that only + one spinner can compete for the mutex. + + The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spinlock + with the desirable properties of being fair and with each cpu trying + to acquire the lock spinning on a local variable. It avoids expensive + cacheline bouncing that common test-and-set spinlock implementations + incur. An MCS-like lock is specially tailored for optimistic spinning + for sleeping lock implementation. An important feature of the customized + MCS lock is that it has the extra property that spinners are able to exit + the MCS spinlock queue when they need to reschedule. This further helps + avoid situations where MCS spinners that need to reschedule would continue + waiting to spin on mutex owner, only to go directly to slowpath upon + obtaining the MCS lock. + + +(iii) slowpath: last resort, if the lock is still unable to be acquired, + the task is added to the wait-queue and sleeps until woken up by the + unlock path. Under normal circumstances it blocks as TASK_UNINTERRUPTIBLE. + +While formally kernel mutexes are sleepable locks, it is path (ii) that +makes them more practically a hybrid type. By simply not interrupting a +task and busy-waiting for a few cycles instead of immediately sleeping, +the performance of this lock has been seen to significantly improve a +number of workloads. Note that this technique is also used for rw-semaphores. + +Semantics +--------- + +The mutex subsystem checks and enforces the following rules: + + - Only one task can hold the mutex at a time. + - Only the owner can unlock the mutex. + - Multiple unlocks are not permitted. + - Recursive locking/unlocking is not permitted. + - A mutex must only be initialized via the API (see below). + - A task may not exit with a mutex held. + - Memory areas where held locks reside must not be freed. + - Held mutexes must not be reinitialized. + - Mutexes may not be used in hardware or software interrupt + contexts such as tasklets and timers. + +These semantics are fully enforced when CONFIG DEBUG_MUTEXES is enabled. +In addition, the mutex debugging code also implements a number of other +features that make lock debugging easier and faster: + + - Uses symbolic names of mutexes, whenever they are printed + in debug output. + - Point-of-acquire tracking, symbolic lookup of function names, + list of all locks held in the system, printout of them. + - Owner tracking. + - Detects self-recursing locks and prints out all relevant info. + - Detects multi-task circular deadlocks and prints out all affected + locks and tasks (and only those tasks). + + +Interfaces +---------- +Statically define the mutex:: + + DEFINE_MUTEX(name); + +Dynamically initialize the mutex:: + + mutex_init(mutex); + +Acquire the mutex, uninterruptible:: + + void mutex_lock(struct mutex *lock); + void mutex_lock_nested(struct mutex *lock, unsigned int subclass); + int mutex_trylock(struct mutex *lock); + +Acquire the mutex, interruptible:: + + int mutex_lock_interruptible_nested(struct mutex *lock, + unsigned int subclass); + int mutex_lock_interruptible(struct mutex *lock); + +Acquire the mutex, interruptible, if dec to 0:: + + int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); + +Unlock the mutex:: + + void mutex_unlock(struct mutex *lock); + +Test if the mutex is taken:: + + int mutex_is_locked(struct mutex *lock); + +Disadvantages +------------- + +Unlike its original design and purpose, 'struct mutex' is among the largest +locks in the kernel. E.g: on x86-64 it is 32 bytes, where 'struct semaphore' +is 24 bytes and rw_semaphore is 40 bytes. Larger structure sizes mean more CPU +cache and memory footprint. + +When to use mutexes +------------------- + +Unless the strict semantics of mutexes are unsuitable and/or the critical +region prevents the lock from being shared, always prefer them to any other +locking primitive. diff --git a/Documentation/locking/mutex-design.txt b/Documentation/locking/mutex-design.txt deleted file mode 100644 index 818aca19612f..000000000000 --- a/Documentation/locking/mutex-design.txt +++ /dev/null @@ -1,142 +0,0 @@ -Generic Mutex Subsystem - -started by Ingo Molnar -updated by Davidlohr Bueso - -What are mutexes? ------------------ - -In the Linux kernel, mutexes refer to a particular locking primitive -that enforces serialization on shared memory systems, and not only to -the generic term referring to 'mutual exclusion' found in academia -or similar theoretical text books. Mutexes are sleeping locks which -behave similarly to binary semaphores, and were introduced in 2006[1] -as an alternative to these. This new data structure provided a number -of advantages, including simpler interfaces, and at that time smaller -code (see Disadvantages). - -[1] http://lwn.net/Articles/164802/ - -Implementation --------------- - -Mutexes are represented by 'struct mutex', defined in include/linux/mutex.h -and implemented in kernel/locking/mutex.c. These locks use an atomic variable -(->owner) to keep track of the lock state during its lifetime. Field owner -actually contains 'struct task_struct *' to the current lock owner and it is -therefore NULL if not currently owned. Since task_struct pointers are aligned -at at least L1_CACHE_BYTES, low bits (3) are used to store extra state (e.g., -if waiter list is non-empty). In its most basic form it also includes a -wait-queue and a spinlock that serializes access to it. Furthermore, -CONFIG_MUTEX_SPIN_ON_OWNER=y systems use a spinner MCS lock (->osq), described -below in (ii). - -When acquiring a mutex, there are three possible paths that can be -taken, depending on the state of the lock: - -(i) fastpath: tries to atomically acquire the lock by cmpxchg()ing the owner with - the current task. This only works in the uncontended case (cmpxchg() checks - against 0UL, so all 3 state bits above have to be 0). If the lock is - contended it goes to the next possible path. - -(ii) midpath: aka optimistic spinning, tries to spin for acquisition - while the lock owner is running and there are no other tasks ready - to run that have higher priority (need_resched). The rationale is - that if the lock owner is running, it is likely to release the lock - soon. The mutex spinners are queued up using MCS lock so that only - one spinner can compete for the mutex. - - The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spinlock - with the desirable properties of being fair and with each cpu trying - to acquire the lock spinning on a local variable. It avoids expensive - cacheline bouncing that common test-and-set spinlock implementations - incur. An MCS-like lock is specially tailored for optimistic spinning - for sleeping lock implementation. An important feature of the customized - MCS lock is that it has the extra property that spinners are able to exit - the MCS spinlock queue when they need to reschedule. This further helps - avoid situations where MCS spinners that need to reschedule would continue - waiting to spin on mutex owner, only to go directly to slowpath upon - obtaining the MCS lock. - - -(iii) slowpath: last resort, if the lock is still unable to be acquired, - the task is added to the wait-queue and sleeps until woken up by the - unlock path. Under normal circumstances it blocks as TASK_UNINTERRUPTIBLE. - -While formally kernel mutexes are sleepable locks, it is path (ii) that -makes them more practically a hybrid type. By simply not interrupting a -task and busy-waiting for a few cycles instead of immediately sleeping, -the performance of this lock has been seen to significantly improve a -number of workloads. Note that this technique is also used for rw-semaphores. - -Semantics ---------- - -The mutex subsystem checks and enforces the following rules: - - - Only one task can hold the mutex at a time. - - Only the owner can unlock the mutex. - - Multiple unlocks are not permitted. - - Recursive locking/unlocking is not permitted. - - A mutex must only be initialized via the API (see below). - - A task may not exit with a mutex held. - - Memory areas where held locks reside must not be freed. - - Held mutexes must not be reinitialized. - - Mutexes may not be used in hardware or software interrupt - contexts such as tasklets and timers. - -These semantics are fully enforced when CONFIG DEBUG_MUTEXES is enabled. -In addition, the mutex debugging code also implements a number of other -features that make lock debugging easier and faster: - - - Uses symbolic names of mutexes, whenever they are printed - in debug output. - - Point-of-acquire tracking, symbolic lookup of function names, - list of all locks held in the system, printout of them. - - Owner tracking. - - Detects self-recursing locks and prints out all relevant info. - - Detects multi-task circular deadlocks and prints out all affected - locks and tasks (and only those tasks). - - -Interfaces ----------- -Statically define the mutex: - DEFINE_MUTEX(name); - -Dynamically initialize the mutex: - mutex_init(mutex); - -Acquire the mutex, uninterruptible: - void mutex_lock(struct mutex *lock); - void mutex_lock_nested(struct mutex *lock, unsigned int subclass); - int mutex_trylock(struct mutex *lock); - -Acquire the mutex, interruptible: - int mutex_lock_interruptible_nested(struct mutex *lock, - unsigned int subclass); - int mutex_lock_interruptible(struct mutex *lock); - -Acquire the mutex, interruptible, if dec to 0: - int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); - -Unlock the mutex: - void mutex_unlock(struct mutex *lock); - -Test if the mutex is taken: - int mutex_is_locked(struct mutex *lock); - -Disadvantages -------------- - -Unlike its original design and purpose, 'struct mutex' is among the largest -locks in the kernel. E.g: on x86-64 it is 32 bytes, where 'struct semaphore' -is 24 bytes and rw_semaphore is 40 bytes. Larger structure sizes mean more CPU -cache and memory footprint. - -When to use mutexes -------------------- - -Unless the strict semantics of mutexes are unsuitable and/or the critical -region prevents the lock from being shared, always prefer them to any other -locking primitive. diff --git a/Documentation/locking/rt-mutex-design.rst b/Documentation/locking/rt-mutex-design.rst new file mode 100644 index 000000000000..59c2a64efb21 --- /dev/null +++ b/Documentation/locking/rt-mutex-design.rst @@ -0,0 +1,574 @@ +============================== +RT-mutex implementation design +============================== + +Copyright (c) 2006 Steven Rostedt + +Licensed under the GNU Free Documentation License, Version 1.2 + + +This document tries to describe the design of the rtmutex.c implementation. +It doesn't describe the reasons why rtmutex.c exists. For that please see +Documentation/locking/rt-mutex.rst. Although this document does explain problems +that happen without this code, but that is in the concept to understand +what the code actually is doing. + +The goal of this document is to help others understand the priority +inheritance (PI) algorithm that is used, as well as reasons for the +decisions that were made to implement PI in the manner that was done. + + +Unbounded Priority Inversion +---------------------------- + +Priority inversion is when a lower priority process executes while a higher +priority process wants to run. This happens for several reasons, and +most of the time it can't be helped. Anytime a high priority process wants +to use a resource that a lower priority process has (a mutex for example), +the high priority process must wait until the lower priority process is done +with the resource. This is a priority inversion. What we want to prevent +is something called unbounded priority inversion. That is when the high +priority process is prevented from running by a lower priority process for +an undetermined amount of time. + +The classic example of unbounded priority inversion is where you have three +processes, let's call them processes A, B, and C, where A is the highest +priority process, C is the lowest, and B is in between. A tries to grab a lock +that C owns and must wait and lets C run to release the lock. But in the +meantime, B executes, and since B is of a higher priority than C, it preempts C, +but by doing so, it is in fact preempting A which is a higher priority process. +Now there's no way of knowing how long A will be sleeping waiting for C +to release the lock, because for all we know, B is a CPU hog and will +never give C a chance to release the lock. This is called unbounded priority +inversion. + +Here's a little ASCII art to show the problem:: + + grab lock L1 (owned by C) + | + A ---+ + C preempted by B + | + C +----+ + + B +--------> + B now keeps A from running. + + +Priority Inheritance (PI) +------------------------- + +There are several ways to solve this issue, but other ways are out of scope +for this document. Here we only discuss PI. + +PI is where a process inherits the priority of another process if the other +process blocks on a lock owned by the current process. To make this easier +to understand, let's use the previous example, with processes A, B, and C again. + +This time, when A blocks on the lock owned by C, C would inherit the priority +of A. So now if B becomes runnable, it would not preempt C, since C now has +the high priority of A. As soon as C releases the lock, it loses its +inherited priority, and A then can continue with the resource that C had. + +Terminology +----------- + +Here I explain some terminology that is used in this document to help describe +the design that is used to implement PI. + +PI chain + - The PI chain is an ordered series of locks and processes that cause + processes to inherit priorities from a previous process that is + blocked on one of its locks. This is described in more detail + later in this document. + +mutex + - In this document, to differentiate from locks that implement + PI and spin locks that are used in the PI code, from now on + the PI locks will be called a mutex. + +lock + - In this document from now on, I will use the term lock when + referring to spin locks that are used to protect parts of the PI + algorithm. These locks disable preemption for UP (when + CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from + entering critical sections simultaneously. + +spin lock + - Same as lock above. + +waiter + - A waiter is a struct that is stored on the stack of a blocked + process. Since the scope of the waiter is within the code for + a process being blocked on the mutex, it is fine to allocate + the waiter on the process's stack (local variable). This + structure holds a pointer to the task, as well as the mutex that + the task is blocked on. It also has rbtree node structures to + place the task in the waiters rbtree of a mutex as well as the + pi_waiters rbtree of a mutex owner task (described below). + + waiter is sometimes used in reference to the task that is waiting + on a mutex. This is the same as waiter->task. + +waiters + - A list of processes that are blocked on a mutex. + +top waiter + - The highest priority process waiting on a specific mutex. + +top pi waiter + - The highest priority process waiting on one of the mutexes + that a specific process owns. + +Note: + task and process are used interchangeably in this document, mostly to + differentiate between two processes that are being described together. + + +PI chain +-------- + +The PI chain is a list of processes and mutexes that may cause priority +inheritance to take place. Multiple chains may converge, but a chain +would never diverge, since a process can't be blocked on more than one +mutex at a time. + +Example:: + + Process: A, B, C, D, E + Mutexes: L1, L2, L3, L4 + + A owns: L1 + B blocked on L1 + B owns L2 + C blocked on L2 + C owns L3 + D blocked on L3 + D owns L4 + E blocked on L4 + +The chain would be:: + + E->L4->D->L3->C->L2->B->L1->A + +To show where two chains merge, we could add another process F and +another mutex L5 where B owns L5 and F is blocked on mutex L5. + +The chain for F would be:: + + F->L5->B->L1->A + +Since a process may own more than one mutex, but never be blocked on more than +one, the chains merge. + +Here we show both chains:: + + E->L4->D->L3->C->L2-+ + | + +->B->L1->A + | + F->L5-+ + +For PI to work, the processes at the right end of these chains (or we may +also call it the Top of the chain) must be equal to or higher in priority +than the processes to the left or below in the chain. + +Also since a mutex may have more than one process blocked on it, we can +have multiple chains merge at mutexes. If we add another process G that is +blocked on mutex L2:: + + G->L2->B->L1->A + +And once again, to show how this can grow I will show the merging chains +again:: + + E->L4->D->L3->C-+ + +->L2-+ + | | + G-+ +->B->L1->A + | + F->L5-+ + +If process G has the highest priority in the chain, then all the tasks up +the chain (A and B in this example), must have their priorities increased +to that of G. + +Mutex Waiters Tree +------------------ + +Every mutex keeps track of all the waiters that are blocked on itself. The +mutex has a rbtree to store these waiters by priority. This tree is protected +by a spin lock that is located in the struct of the mutex. This lock is called +wait_lock. + + +Task PI Tree +------------ + +To keep track of the PI chains, each process has its own PI rbtree. This is +a tree of all top waiters of the mutexes that are owned by the process. +Note that this tree only holds the top waiters and not all waiters that are +blocked on mutexes owned by the process. + +The top of the task's PI tree is always the highest priority task that +is waiting on a mutex that is owned by the task. So if the task has +inherited a priority, it will always be the priority of the task that is +at the top of this tree. + +This tree is stored in the task structure of a process as a rbtree called +pi_waiters. It is protected by a spin lock also in the task structure, +called pi_lock. This lock may also be taken in interrupt context, so when +locking the pi_lock, interrupts must be disabled. + + +Depth of the PI Chain +--------------------- + +The maximum depth of the PI chain is not dynamic, and could actually be +defined. But is very complex to figure it out, since it depends on all +the nesting of mutexes. Let's look at the example where we have 3 mutexes, +L1, L2, and L3, and four separate functions func1, func2, func3 and func4. +The following shows a locking order of L1->L2->L3, but may not actually +be directly nested that way:: + + void func1(void) + { + mutex_lock(L1); + + /* do anything */ + + mutex_unlock(L1); + } + + void func2(void) + { + mutex_lock(L1); + mutex_lock(L2); + + /* do something */ + + mutex_unlock(L2); + mutex_unlock(L1); + } + + void func3(void) + { + mutex_lock(L2); + mutex_lock(L3); + + /* do something else */ + + mutex_unlock(L3); + mutex_unlock(L2); + } + + void func4(void) + { + mutex_lock(L3); + + /* do something again */ + + mutex_unlock(L3); + } + +Now we add 4 processes that run each of these functions separately. +Processes A, B, C, and D which run functions func1, func2, func3 and func4 +respectively, and such that D runs first and A last. With D being preempted +in func4 in the "do something again" area, we have a locking that follows:: + + D owns L3 + C blocked on L3 + C owns L2 + B blocked on L2 + B owns L1 + A blocked on L1 + + And thus we have the chain A->L1->B->L2->C->L3->D. + +This gives us a PI depth of 4 (four processes), but looking at any of the +functions individually, it seems as though they only have at most a locking +depth of two. So, although the locking depth is defined at compile time, +it still is very difficult to find the possibilities of that depth. + +Now since mutexes can be defined by user-land applications, we don't want a DOS +type of application that nests large amounts of mutexes to create a large +PI chain, and have the code holding spin locks while looking at a large +amount of data. So to prevent this, the implementation not only implements +a maximum lock depth, but also only holds at most two different locks at a +time, as it walks the PI chain. More about this below. + + +Mutex owner and flags +--------------------- + +The mutex structure contains a pointer to the owner of the mutex. If the +mutex is not owned, this owner is set to NULL. Since all architectures +have the task structure on at least a two byte alignment (and if this is +not true, the rtmutex.c code will be broken!), this allows for the least +significant bit to be used as a flag. Bit 0 is used as the "Has Waiters" +flag. It's set whenever there are waiters on a mutex. + +See Documentation/locking/rt-mutex.rst for further details. + +cmpxchg Tricks +-------------- + +Some architectures implement an atomic cmpxchg (Compare and Exchange). This +is used (when applicable) to keep the fast path of grabbing and releasing +mutexes short. + +cmpxchg is basically the following function performed atomically:: + + unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) + { + unsigned long T = *A; + if (*A == *B) { + *A = *C; + } + return T; + } + #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) + +This is really nice to have, since it allows you to only update a variable +if the variable is what you expect it to be. You know if it succeeded if +the return value (the old value of A) is equal to B. + +The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If +the architecture does not support CMPXCHG, then this macro is simply set +to fail every time. But if CMPXCHG is supported, then this will +help out extremely to keep the fast path short. + +The use of rt_mutex_cmpxchg with the flags in the owner field help optimize +the system for architectures that support it. This will also be explained +later in this document. + + +Priority adjustments +-------------------- + +The implementation of the PI code in rtmutex.c has several places that a +process must adjust its priority. With the help of the pi_waiters of a +process this is rather easy to know what needs to be adjusted. + +The functions implementing the task adjustments are rt_mutex_adjust_prio +and rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio. + +rt_mutex_adjust_prio examines the priority of the task, and the highest +priority process that is waiting any of mutexes owned by the task. Since +the pi_waiters of a task holds an order by priority of all the top waiters +of all the mutexes that the task owns, we simply need to compare the top +pi waiter to its own normal/deadline priority and take the higher one. +Then rt_mutex_setprio is called to adjust the priority of the task to the +new priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c +to implement the actual change in priority. + +Note: + For the "prio" field in task_struct, the lower the number, the + higher the priority. A "prio" of 5 is of higher priority than a + "prio" of 10. + +It is interesting to note that rt_mutex_adjust_prio can either increase +or decrease the priority of the task. In the case that a higher priority +process has just blocked on a mutex owned by the task, rt_mutex_adjust_prio +would increase/boost the task's priority. But if a higher priority task +were for some reason to leave the mutex (timeout or signal), this same function +would decrease/unboost the priority of the task. That is because the pi_waiters +always contains the highest priority task that is waiting on a mutex owned +by the task, so we only need to compare the priority of that top pi waiter +to the normal priority of the given task. + + +High level overview of the PI chain walk +---------------------------------------- + +The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. + +The implementation has gone through several iterations, and has ended up +with what we believe is the best. It walks the PI chain by only grabbing +at most two locks at a time, and is very efficient. + +The rt_mutex_adjust_prio_chain can be used either to boost or lower process +priorities. + +rt_mutex_adjust_prio_chain is called with a task to be checked for PI +(de)boosting (the owner of a mutex that a process is blocking on), a flag to +check for deadlocking, the mutex that the task owns, a pointer to a waiter +that is the process's waiter struct that is blocked on the mutex (although this +parameter may be NULL for deboosting), a pointer to the mutex on which the task +is blocked, and a top_task as the top waiter of the mutex. + +For this explanation, I will not mention deadlock detection. This explanation +will try to stay at a high level. + +When this function is called, there are no locks held. That also means +that the state of the owner and lock can change when entered into this function. + +Before this function is called, the task has already had rt_mutex_adjust_prio +performed on it. This means that the task is set to the priority that it +should be at, but the rbtree nodes of the task's waiter have not been updated +with the new priorities, and this task may not be in the proper locations +in the pi_waiters and waiters trees that the task is blocked on. This function +solves all that. + +The main operation of this function is summarized by Thomas Gleixner in +rtmutex.c. See the 'Chain walk basics and protection scope' comment for further +details. + +Taking of a mutex (The walk through) +------------------------------------ + +OK, now let's take a look at the detailed walk through of what happens when +taking a mutex. + +The first thing that is tried is the fast taking of the mutex. This is +done when we have CMPXCHG enabled (otherwise the fast taking automatically +fails). Only when the owner field of the mutex is NULL can the lock be +taken with the CMPXCHG and nothing else needs to be done. + +If there is contention on the lock, we go about the slow path +(rt_mutex_slowlock). + +The slow path function is where the task's waiter structure is created on +the stack. This is because the waiter structure is only needed for the +scope of this function. The waiter structure holds the nodes to store +the task on the waiters tree of the mutex, and if need be, the pi_waiters +tree of the owner. + +The wait_lock of the mutex is taken since the slow path of unlocking the +mutex also takes this lock. + +We then call try_to_take_rt_mutex. This is where the architecture that +does not implement CMPXCHG would always grab the lock (if there's no +contention). + +try_to_take_rt_mutex is used every time the task tries to grab a mutex in the +slow path. The first thing that is done here is an atomic setting of +the "Has Waiters" flag of the mutex's owner field. By setting this flag +now, the current owner of the mutex being contended for can't release the mutex +without going into the slow unlock path, and it would then need to grab the +wait_lock, which this code currently holds. So setting the "Has Waiters" flag +forces the current owner to synchronize with this code. + +The lock is taken if the following are true: + + 1) The lock has no owner + 2) The current task is the highest priority against all other + waiters of the lock + +If the task succeeds to acquire the lock, then the task is set as the +owner of the lock, and if the lock still has waiters, the top_waiter +(highest priority task waiting on the lock) is added to this task's +pi_waiters tree. + +If the lock is not taken by try_to_take_rt_mutex(), then the +task_blocks_on_rt_mutex() function is called. This will add the task to +the lock's waiter tree and propagate the pi chain of the lock as well +as the lock's owner's pi_waiters tree. This is described in the next +section. + +Task blocks on mutex +-------------------- + +The accounting of a mutex and process is done with the waiter structure of +the process. The "task" field is set to the process, and the "lock" field +to the mutex. The rbtree node of waiter are initialized to the processes +current priority. + +Since the wait_lock was taken at the entry of the slow lock, we can safely +add the waiter to the task waiter tree. If the current process is the +highest priority process currently waiting on this mutex, then we remove the +previous top waiter process (if it exists) from the pi_waiters of the owner, +and add the current process to that tree. Since the pi_waiter of the owner +has changed, we call rt_mutex_adjust_prio on the owner to see if the owner +should adjust its priority accordingly. + +If the owner is also blocked on a lock, and had its pi_waiters changed +(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead +and run rt_mutex_adjust_prio_chain on the owner, as described earlier. + +Now all locks are released, and if the current process is still blocked on a +mutex (waiter "task" field is not NULL), then we go to sleep (call schedule). + +Waking up in the loop +--------------------- + +The task can then wake up for a couple of reasons: + 1) The previous lock owner released the lock, and the task now is top_waiter + 2) we received a signal or timeout + +In both cases, the task will try again to acquire the lock. If it +does, then it will take itself off the waiters tree and set itself back +to the TASK_RUNNING state. + +In first case, if the lock was acquired by another task before this task +could get the lock, then it will go back to sleep and wait to be woken again. + +The second case is only applicable for tasks that are grabbing a mutex +that can wake up before getting the lock, either due to a signal or +a timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to +take the lock again, if it succeeds, then the task will return with the +lock held, otherwise it will return with -EINTR if the task was woken +by a signal, or -ETIMEDOUT if it timed out. + + +Unlocking the Mutex +------------------- + +The unlocking of a mutex also has a fast path for those architectures with +CMPXCHG. Since the taking of a mutex on contention always sets the +"Has Waiters" flag of the mutex's owner, we use this to know if we need to +take the slow path when unlocking the mutex. If the mutex doesn't have any +waiters, the owner field of the mutex would equal the current process and +the mutex can be unlocked by just replacing the owner field with NULL. + +If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), +the slow unlock path is taken. + +The first thing done in the slow unlock path is to take the wait_lock of the +mutex. This synchronizes the locking and unlocking of the mutex. + +A check is made to see if the mutex has waiters or not. On architectures that +do not have CMPXCHG, this is the location that the owner of the mutex will +determine if a waiter needs to be awoken or not. On architectures that +do have CMPXCHG, that check is done in the fast path, but it is still needed +in the slow path too. If a waiter of a mutex woke up because of a signal +or timeout between the time the owner failed the fast path CMPXCHG check and +the grabbing of the wait_lock, the mutex may not have any waiters, thus the +owner still needs to make this check. If there are no waiters then the mutex +owner field is set to NULL, the wait_lock is released and nothing more is +needed. + +If there are waiters, then we need to wake one up. + +On the wake up code, the pi_lock of the current owner is taken. The top +waiter of the lock is found and removed from the waiters tree of the mutex +as well as the pi_waiters tree of the current owner. The "Has Waiters" bit is +marked to prevent lower priority tasks from stealing the lock. + +Finally we unlock the pi_lock of the pending owner and wake it up. + + +Contact +------- + +For updates on this document, please email Steven Rostedt + + +Credits +------- + +Author: Steven Rostedt + +Updated: Alex Shi - 7/6/2017 + +Original Reviewers: + Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and + Randy Dunlap + +Update (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior + +Updates +------- + +This document was originally written for 2.6.17-rc3-mm1 +was updated on 4.12 diff --git a/Documentation/locking/rt-mutex-design.txt b/Documentation/locking/rt-mutex-design.txt deleted file mode 100644 index 3d7b865539cc..000000000000 --- a/Documentation/locking/rt-mutex-design.txt +++ /dev/null @@ -1,559 +0,0 @@ -# -# Copyright (c) 2006 Steven Rostedt -# Licensed under the GNU Free Documentation License, Version 1.2 -# - -RT-mutex implementation design ------------------------------- - -This document tries to describe the design of the rtmutex.c implementation. -It doesn't describe the reasons why rtmutex.c exists. For that please see -Documentation/locking/rt-mutex.txt. Although this document does explain problems -that happen without this code, but that is in the concept to understand -what the code actually is doing. - -The goal of this document is to help others understand the priority -inheritance (PI) algorithm that is used, as well as reasons for the -decisions that were made to implement PI in the manner that was done. - - -Unbounded Priority Inversion ----------------------------- - -Priority inversion is when a lower priority process executes while a higher -priority process wants to run. This happens for several reasons, and -most of the time it can't be helped. Anytime a high priority process wants -to use a resource that a lower priority process has (a mutex for example), -the high priority process must wait until the lower priority process is done -with the resource. This is a priority inversion. What we want to prevent -is something called unbounded priority inversion. That is when the high -priority process is prevented from running by a lower priority process for -an undetermined amount of time. - -The classic example of unbounded priority inversion is where you have three -processes, let's call them processes A, B, and C, where A is the highest -priority process, C is the lowest, and B is in between. A tries to grab a lock -that C owns and must wait and lets C run to release the lock. But in the -meantime, B executes, and since B is of a higher priority than C, it preempts C, -but by doing so, it is in fact preempting A which is a higher priority process. -Now there's no way of knowing how long A will be sleeping waiting for C -to release the lock, because for all we know, B is a CPU hog and will -never give C a chance to release the lock. This is called unbounded priority -inversion. - -Here's a little ASCII art to show the problem. - - grab lock L1 (owned by C) - | -A ---+ - C preempted by B - | -C +----+ - -B +--------> - B now keeps A from running. - - -Priority Inheritance (PI) -------------------------- - -There are several ways to solve this issue, but other ways are out of scope -for this document. Here we only discuss PI. - -PI is where a process inherits the priority of another process if the other -process blocks on a lock owned by the current process. To make this easier -to understand, let's use the previous example, with processes A, B, and C again. - -This time, when A blocks on the lock owned by C, C would inherit the priority -of A. So now if B becomes runnable, it would not preempt C, since C now has -the high priority of A. As soon as C releases the lock, it loses its -inherited priority, and A then can continue with the resource that C had. - -Terminology ------------ - -Here I explain some terminology that is used in this document to help describe -the design that is used to implement PI. - -PI chain - The PI chain is an ordered series of locks and processes that cause - processes to inherit priorities from a previous process that is - blocked on one of its locks. This is described in more detail - later in this document. - -mutex - In this document, to differentiate from locks that implement - PI and spin locks that are used in the PI code, from now on - the PI locks will be called a mutex. - -lock - In this document from now on, I will use the term lock when - referring to spin locks that are used to protect parts of the PI - algorithm. These locks disable preemption for UP (when - CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from - entering critical sections simultaneously. - -spin lock - Same as lock above. - -waiter - A waiter is a struct that is stored on the stack of a blocked - process. Since the scope of the waiter is within the code for - a process being blocked on the mutex, it is fine to allocate - the waiter on the process's stack (local variable). This - structure holds a pointer to the task, as well as the mutex that - the task is blocked on. It also has rbtree node structures to - place the task in the waiters rbtree of a mutex as well as the - pi_waiters rbtree of a mutex owner task (described below). - - waiter is sometimes used in reference to the task that is waiting - on a mutex. This is the same as waiter->task. - -waiters - A list of processes that are blocked on a mutex. - -top waiter - The highest priority process waiting on a specific mutex. - -top pi waiter - The highest priority process waiting on one of the mutexes - that a specific process owns. - -Note: task and process are used interchangeably in this document, mostly to - differentiate between two processes that are being described together. - - -PI chain --------- - -The PI chain is a list of processes and mutexes that may cause priority -inheritance to take place. Multiple chains may converge, but a chain -would never diverge, since a process can't be blocked on more than one -mutex at a time. - -Example: - - Process: A, B, C, D, E - Mutexes: L1, L2, L3, L4 - - A owns: L1 - B blocked on L1 - B owns L2 - C blocked on L2 - C owns L3 - D blocked on L3 - D owns L4 - E blocked on L4 - -The chain would be: - - E->L4->D->L3->C->L2->B->L1->A - -To show where two chains merge, we could add another process F and -another mutex L5 where B owns L5 and F is blocked on mutex L5. - -The chain for F would be: - - F->L5->B->L1->A - -Since a process may own more than one mutex, but never be blocked on more than -one, the chains merge. - -Here we show both chains: - - E->L4->D->L3->C->L2-+ - | - +->B->L1->A - | - F->L5-+ - -For PI to work, the processes at the right end of these chains (or we may -also call it the Top of the chain) must be equal to or higher in priority -than the processes to the left or below in the chain. - -Also since a mutex may have more than one process blocked on it, we can -have multiple chains merge at mutexes. If we add another process G that is -blocked on mutex L2: - - G->L2->B->L1->A - -And once again, to show how this can grow I will show the merging chains -again. - - E->L4->D->L3->C-+ - +->L2-+ - | | - G-+ +->B->L1->A - | - F->L5-+ - -If process G has the highest priority in the chain, then all the tasks up -the chain (A and B in this example), must have their priorities increased -to that of G. - -Mutex Waiters Tree ------------------ - -Every mutex keeps track of all the waiters that are blocked on itself. The -mutex has a rbtree to store these waiters by priority. This tree is protected -by a spin lock that is located in the struct of the mutex. This lock is called -wait_lock. - - -Task PI Tree ------------- - -To keep track of the PI chains, each process has its own PI rbtree. This is -a tree of all top waiters of the mutexes that are owned by the process. -Note that this tree only holds the top waiters and not all waiters that are -blocked on mutexes owned by the process. - -The top of the task's PI tree is always the highest priority task that -is waiting on a mutex that is owned by the task. So if the task has -inherited a priority, it will always be the priority of the task that is -at the top of this tree. - -This tree is stored in the task structure of a process as a rbtree called -pi_waiters. It is protected by a spin lock also in the task structure, -called pi_lock. This lock may also be taken in interrupt context, so when -locking the pi_lock, interrupts must be disabled. - - -Depth of the PI Chain ---------------------- - -The maximum depth of the PI chain is not dynamic, and could actually be -defined. But is very complex to figure it out, since it depends on all -the nesting of mutexes. Let's look at the example where we have 3 mutexes, -L1, L2, and L3, and four separate functions func1, func2, func3 and func4. -The following shows a locking order of L1->L2->L3, but may not actually -be directly nested that way. - -void func1(void) -{ - mutex_lock(L1); - - /* do anything */ - - mutex_unlock(L1); -} - -void func2(void) -{ - mutex_lock(L1); - mutex_lock(L2); - - /* do something */ - - mutex_unlock(L2); - mutex_unlock(L1); -} - -void func3(void) -{ - mutex_lock(L2); - mutex_lock(L3); - - /* do something else */ - - mutex_unlock(L3); - mutex_unlock(L2); -} - -void func4(void) -{ - mutex_lock(L3); - - /* do something again */ - - mutex_unlock(L3); -} - -Now we add 4 processes that run each of these functions separately. -Processes A, B, C, and D which run functions func1, func2, func3 and func4 -respectively, and such that D runs first and A last. With D being preempted -in func4 in the "do something again" area, we have a locking that follows: - -D owns L3 - C blocked on L3 - C owns L2 - B blocked on L2 - B owns L1 - A blocked on L1 - -And thus we have the chain A->L1->B->L2->C->L3->D. - -This gives us a PI depth of 4 (four processes), but looking at any of the -functions individually, it seems as though they only have at most a locking -depth of two. So, although the locking depth is defined at compile time, -it still is very difficult to find the possibilities of that depth. - -Now since mutexes can be defined by user-land applications, we don't want a DOS -type of application that nests large amounts of mutexes to create a large -PI chain, and have the code holding spin locks while looking at a large -amount of data. So to prevent this, the implementation not only implements -a maximum lock depth, but also only holds at most two different locks at a -time, as it walks the PI chain. More about this below. - - -Mutex owner and flags ---------------------- - -The mutex structure contains a pointer to the owner of the mutex. If the -mutex is not owned, this owner is set to NULL. Since all architectures -have the task structure on at least a two byte alignment (and if this is -not true, the rtmutex.c code will be broken!), this allows for the least -significant bit to be used as a flag. Bit 0 is used as the "Has Waiters" -flag. It's set whenever there are waiters on a mutex. - -See Documentation/locking/rt-mutex.txt for further details. - -cmpxchg Tricks --------------- - -Some architectures implement an atomic cmpxchg (Compare and Exchange). This -is used (when applicable) to keep the fast path of grabbing and releasing -mutexes short. - -cmpxchg is basically the following function performed atomically: - -unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) -{ - unsigned long T = *A; - if (*A == *B) { - *A = *C; - } - return T; -} -#define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) - -This is really nice to have, since it allows you to only update a variable -if the variable is what you expect it to be. You know if it succeeded if -the return value (the old value of A) is equal to B. - -The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If -the architecture does not support CMPXCHG, then this macro is simply set -to fail every time. But if CMPXCHG is supported, then this will -help out extremely to keep the fast path short. - -The use of rt_mutex_cmpxchg with the flags in the owner field help optimize -the system for architectures that support it. This will also be explained -later in this document. - - -Priority adjustments --------------------- - -The implementation of the PI code in rtmutex.c has several places that a -process must adjust its priority. With the help of the pi_waiters of a -process this is rather easy to know what needs to be adjusted. - -The functions implementing the task adjustments are rt_mutex_adjust_prio -and rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio. - -rt_mutex_adjust_prio examines the priority of the task, and the highest -priority process that is waiting any of mutexes owned by the task. Since -the pi_waiters of a task holds an order by priority of all the top waiters -of all the mutexes that the task owns, we simply need to compare the top -pi waiter to its own normal/deadline priority and take the higher one. -Then rt_mutex_setprio is called to adjust the priority of the task to the -new priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c -to implement the actual change in priority. - -(Note: For the "prio" field in task_struct, the lower the number, the - higher the priority. A "prio" of 5 is of higher priority than a - "prio" of 10.) - -It is interesting to note that rt_mutex_adjust_prio can either increase -or decrease the priority of the task. In the case that a higher priority -process has just blocked on a mutex owned by the task, rt_mutex_adjust_prio -would increase/boost the task's priority. But if a higher priority task -were for some reason to leave the mutex (timeout or signal), this same function -would decrease/unboost the priority of the task. That is because the pi_waiters -always contains the highest priority task that is waiting on a mutex owned -by the task, so we only need to compare the priority of that top pi waiter -to the normal priority of the given task. - - -High level overview of the PI chain walk ----------------------------------------- - -The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. - -The implementation has gone through several iterations, and has ended up -with what we believe is the best. It walks the PI chain by only grabbing -at most two locks at a time, and is very efficient. - -The rt_mutex_adjust_prio_chain can be used either to boost or lower process -priorities. - -rt_mutex_adjust_prio_chain is called with a task to be checked for PI -(de)boosting (the owner of a mutex that a process is blocking on), a flag to -check for deadlocking, the mutex that the task owns, a pointer to a waiter -that is the process's waiter struct that is blocked on the mutex (although this -parameter may be NULL for deboosting), a pointer to the mutex on which the task -is blocked, and a top_task as the top waiter of the mutex. - -For this explanation, I will not mention deadlock detection. This explanation -will try to stay at a high level. - -When this function is called, there are no locks held. That also means -that the state of the owner and lock can change when entered into this function. - -Before this function is called, the task has already had rt_mutex_adjust_prio -performed on it. This means that the task is set to the priority that it -should be at, but the rbtree nodes of the task's waiter have not been updated -with the new priorities, and this task may not be in the proper locations -in the pi_waiters and waiters trees that the task is blocked on. This function -solves all that. - -The main operation of this function is summarized by Thomas Gleixner in -rtmutex.c. See the 'Chain walk basics and protection scope' comment for further -details. - -Taking of a mutex (The walk through) ------------------------------------- - -OK, now let's take a look at the detailed walk through of what happens when -taking a mutex. - -The first thing that is tried is the fast taking of the mutex. This is -done when we have CMPXCHG enabled (otherwise the fast taking automatically -fails). Only when the owner field of the mutex is NULL can the lock be -taken with the CMPXCHG and nothing else needs to be done. - -If there is contention on the lock, we go about the slow path -(rt_mutex_slowlock). - -The slow path function is where the task's waiter structure is created on -the stack. This is because the waiter structure is only needed for the -scope of this function. The waiter structure holds the nodes to store -the task on the waiters tree of the mutex, and if need be, the pi_waiters -tree of the owner. - -The wait_lock of the mutex is taken since the slow path of unlocking the -mutex also takes this lock. - -We then call try_to_take_rt_mutex. This is where the architecture that -does not implement CMPXCHG would always grab the lock (if there's no -contention). - -try_to_take_rt_mutex is used every time the task tries to grab a mutex in the -slow path. The first thing that is done here is an atomic setting of -the "Has Waiters" flag of the mutex's owner field. By setting this flag -now, the current owner of the mutex being contended for can't release the mutex -without going into the slow unlock path, and it would then need to grab the -wait_lock, which this code currently holds. So setting the "Has Waiters" flag -forces the current owner to synchronize with this code. - -The lock is taken if the following are true: - 1) The lock has no owner - 2) The current task is the highest priority against all other - waiters of the lock - -If the task succeeds to acquire the lock, then the task is set as the -owner of the lock, and if the lock still has waiters, the top_waiter -(highest priority task waiting on the lock) is added to this task's -pi_waiters tree. - -If the lock is not taken by try_to_take_rt_mutex(), then the -task_blocks_on_rt_mutex() function is called. This will add the task to -the lock's waiter tree and propagate the pi chain of the lock as well -as the lock's owner's pi_waiters tree. This is described in the next -section. - -Task blocks on mutex --------------------- - -The accounting of a mutex and process is done with the waiter structure of -the process. The "task" field is set to the process, and the "lock" field -to the mutex. The rbtree node of waiter are initialized to the processes -current priority. - -Since the wait_lock was taken at the entry of the slow lock, we can safely -add the waiter to the task waiter tree. If the current process is the -highest priority process currently waiting on this mutex, then we remove the -previous top waiter process (if it exists) from the pi_waiters of the owner, -and add the current process to that tree. Since the pi_waiter of the owner -has changed, we call rt_mutex_adjust_prio on the owner to see if the owner -should adjust its priority accordingly. - -If the owner is also blocked on a lock, and had its pi_waiters changed -(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead -and run rt_mutex_adjust_prio_chain on the owner, as described earlier. - -Now all locks are released, and if the current process is still blocked on a -mutex (waiter "task" field is not NULL), then we go to sleep (call schedule). - -Waking up in the loop ---------------------- - -The task can then wake up for a couple of reasons: - 1) The previous lock owner released the lock, and the task now is top_waiter - 2) we received a signal or timeout - -In both cases, the task will try again to acquire the lock. If it -does, then it will take itself off the waiters tree and set itself back -to the TASK_RUNNING state. - -In first case, if the lock was acquired by another task before this task -could get the lock, then it will go back to sleep and wait to be woken again. - -The second case is only applicable for tasks that are grabbing a mutex -that can wake up before getting the lock, either due to a signal or -a timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to -take the lock again, if it succeeds, then the task will return with the -lock held, otherwise it will return with -EINTR if the task was woken -by a signal, or -ETIMEDOUT if it timed out. - - -Unlocking the Mutex -------------------- - -The unlocking of a mutex also has a fast path for those architectures with -CMPXCHG. Since the taking of a mutex on contention always sets the -"Has Waiters" flag of the mutex's owner, we use this to know if we need to -take the slow path when unlocking the mutex. If the mutex doesn't have any -waiters, the owner field of the mutex would equal the current process and -the mutex can be unlocked by just replacing the owner field with NULL. - -If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), -the slow unlock path is taken. - -The first thing done in the slow unlock path is to take the wait_lock of the -mutex. This synchronizes the locking and unlocking of the mutex. - -A check is made to see if the mutex has waiters or not. On architectures that -do not have CMPXCHG, this is the location that the owner of the mutex will -determine if a waiter needs to be awoken or not. On architectures that -do have CMPXCHG, that check is done in the fast path, but it is still needed -in the slow path too. If a waiter of a mutex woke up because of a signal -or timeout between the time the owner failed the fast path CMPXCHG check and -the grabbing of the wait_lock, the mutex may not have any waiters, thus the -owner still needs to make this check. If there are no waiters then the mutex -owner field is set to NULL, the wait_lock is released and nothing more is -needed. - -If there are waiters, then we need to wake one up. - -On the wake up code, the pi_lock of the current owner is taken. The top -waiter of the lock is found and removed from the waiters tree of the mutex -as well as the pi_waiters tree of the current owner. The "Has Waiters" bit is -marked to prevent lower priority tasks from stealing the lock. - -Finally we unlock the pi_lock of the pending owner and wake it up. - - -Contact -------- - -For updates on this document, please email Steven Rostedt - - -Credits -------- - -Author: Steven Rostedt -Updated: Alex Shi - 7/6/2017 - -Original Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and - Randy Dunlap -Update (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior - -Updates -------- - -This document was originally written for 2.6.17-rc3-mm1 -was updated on 4.12 diff --git a/Documentation/locking/rt-mutex.rst b/Documentation/locking/rt-mutex.rst new file mode 100644 index 000000000000..c365dc302081 --- /dev/null +++ b/Documentation/locking/rt-mutex.rst @@ -0,0 +1,77 @@ +================================== +RT-mutex subsystem with PI support +================================== + +RT-mutexes with priority inheritance are used to support PI-futexes, +which enable pthread_mutex_t priority inheritance attributes +(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details +about PI-futexes.] + +This technology was developed in the -rt tree and streamlined for +pthread_mutex support. + +Basic principles: +----------------- + +RT-mutexes extend the semantics of simple mutexes by the priority +inheritance protocol. + +A low priority owner of a rt-mutex inherits the priority of a higher +priority waiter until the rt-mutex is released. If the temporarily +boosted owner blocks on a rt-mutex itself it propagates the priority +boosting to the owner of the other rt_mutex it gets blocked on. The +priority boosting is immediately removed once the rt_mutex has been +unlocked. + +This approach allows us to shorten the block of high-prio tasks on +mutexes which protect shared resources. Priority inheritance is not a +magic bullet for poorly designed applications, but it allows +well-designed applications to use userspace locks in critical parts of +an high priority thread, without losing determinism. + +The enqueueing of the waiters into the rtmutex waiter tree is done in +priority order. For same priorities FIFO order is chosen. For each +rtmutex, only the top priority waiter is enqueued into the owner's +priority waiters tree. This tree too queues in priority order. Whenever +the top priority waiter of a task changes (for example it timed out or +got a signal), the priority of the owner task is readjusted. The +priority enqueueing is handled by "pi_waiters". + +RT-mutexes are optimized for fastpath operations and have no internal +locking overhead when locking an uncontended mutex or unlocking a mutex +without waiters. The optimized fastpath operations require cmpxchg +support. [If that is not available then the rt-mutex internal spinlock +is used] + +The state of the rt-mutex is tracked via the owner field of the rt-mutex +structure: + +lock->owner holds the task_struct pointer of the owner. Bit 0 is used to +keep track of the "lock has waiters" state: + + ============ ======= ================================================ + owner bit0 Notes + ============ ======= ================================================ + NULL 0 lock is free (fast acquire possible) + NULL 1 lock is free and has waiters and the top waiter + is going to take the lock [1]_ + taskpointer 0 lock is held (fast release possible) + taskpointer 1 lock is held and has waiters [2]_ + ============ ======= ================================================ + +The fast atomic compare exchange based acquire and release is only +possible when bit 0 of lock->owner is 0. + +.. [1] It also can be a transitional state when grabbing the lock + with ->wait_lock is held. To prevent any fast path cmpxchg to the lock, + we need to set the bit0 before looking at the lock, and the owner may + be NULL in this small time, hence this can be a transitional state. + +.. [2] There is a small time when bit 0 is set but there are no + waiters. This can happen when grabbing the lock in the slow path. + To prevent a cmpxchg of the owner releasing the lock, we need to + set this bit before looking at the lock. + +BTW, there is still technically a "Pending Owner", it's just not called +that anymore. The pending owner happens to be the top_waiter of a lock +that has no owner and has been woken up to grab the lock. diff --git a/Documentation/locking/rt-mutex.txt b/Documentation/locking/rt-mutex.txt deleted file mode 100644 index 35793e003041..000000000000 --- a/Documentation/locking/rt-mutex.txt +++ /dev/null @@ -1,73 +0,0 @@ -RT-mutex subsystem with PI support ----------------------------------- - -RT-mutexes with priority inheritance are used to support PI-futexes, -which enable pthread_mutex_t priority inheritance attributes -(PTHREAD_PRIO_INHERIT). [See Documentation/pi-futex.txt for more details -about PI-futexes.] - -This technology was developed in the -rt tree and streamlined for -pthread_mutex support. - -Basic principles: ------------------ - -RT-mutexes extend the semantics of simple mutexes by the priority -inheritance protocol. - -A low priority owner of a rt-mutex inherits the priority of a higher -priority waiter until the rt-mutex is released. If the temporarily -boosted owner blocks on a rt-mutex itself it propagates the priority -boosting to the owner of the other rt_mutex it gets blocked on. The -priority boosting is immediately removed once the rt_mutex has been -unlocked. - -This approach allows us to shorten the block of high-prio tasks on -mutexes which protect shared resources. Priority inheritance is not a -magic bullet for poorly designed applications, but it allows -well-designed applications to use userspace locks in critical parts of -an high priority thread, without losing determinism. - -The enqueueing of the waiters into the rtmutex waiter tree is done in -priority order. For same priorities FIFO order is chosen. For each -rtmutex, only the top priority waiter is enqueued into the owner's -priority waiters tree. This tree too queues in priority order. Whenever -the top priority waiter of a task changes (for example it timed out or -got a signal), the priority of the owner task is readjusted. The -priority enqueueing is handled by "pi_waiters". - -RT-mutexes are optimized for fastpath operations and have no internal -locking overhead when locking an uncontended mutex or unlocking a mutex -without waiters. The optimized fastpath operations require cmpxchg -support. [If that is not available then the rt-mutex internal spinlock -is used] - -The state of the rt-mutex is tracked via the owner field of the rt-mutex -structure: - -lock->owner holds the task_struct pointer of the owner. Bit 0 is used to -keep track of the "lock has waiters" state. - - owner bit0 - NULL 0 lock is free (fast acquire possible) - NULL 1 lock is free and has waiters and the top waiter - is going to take the lock* - taskpointer 0 lock is held (fast release possible) - taskpointer 1 lock is held and has waiters** - -The fast atomic compare exchange based acquire and release is only -possible when bit 0 of lock->owner is 0. - -(*) It also can be a transitional state when grabbing the lock -with ->wait_lock is held. To prevent any fast path cmpxchg to the lock, -we need to set the bit0 before looking at the lock, and the owner may be -NULL in this small time, hence this can be a transitional state. - -(**) There is a small time when bit 0 is set but there are no -waiters. This can happen when grabbing the lock in the slow path. -To prevent a cmpxchg of the owner releasing the lock, we need to -set this bit before looking at the lock. - -BTW, there is still technically a "Pending Owner", it's just not called -that anymore. The pending owner happens to be the top_waiter of a lock -that has no owner and has been woken up to grab the lock. diff --git a/Documentation/locking/spinlocks.rst b/Documentation/locking/spinlocks.rst new file mode 100644 index 000000000000..098107fb7d86 --- /dev/null +++ b/Documentation/locking/spinlocks.rst @@ -0,0 +1,177 @@ +=============== +Locking lessons +=============== + +Lesson 1: Spin locks +==================== + +The most basic primitive for locking is spinlock:: + + static DEFINE_SPINLOCK(xxx_lock); + + unsigned long flags; + + spin_lock_irqsave(&xxx_lock, flags); + ... critical section here .. + spin_unlock_irqrestore(&xxx_lock, flags); + +The above is always safe. It will disable interrupts _locally_, but the +spinlock itself will guarantee the global lock, so it will guarantee that +there is only one thread-of-control within the region(s) protected by that +lock. This works well even under UP also, so the code does _not_ need to +worry about UP vs SMP issues: the spinlocks work correctly under both. + + NOTE! Implications of spin_locks for memory are further described in: + + Documentation/memory-barriers.txt + + (5) LOCK operations. + + (6) UNLOCK operations. + +The above is usually pretty simple (you usually need and want only one +spinlock for most things - using more than one spinlock can make things a +lot more complex and even slower and is usually worth it only for +sequences that you **know** need to be split up: avoid it at all cost if you +aren't sure). + +This is really the only really hard part about spinlocks: once you start +using spinlocks they tend to expand to areas you might not have noticed +before, because you have to make sure the spinlocks correctly protect the +shared data structures **everywhere** they are used. The spinlocks are most +easily added to places that are completely independent of other code (for +example, internal driver data structures that nobody else ever touches). + + NOTE! The spin-lock is safe only when you **also** use the lock itself + to do locking across CPU's, which implies that EVERYTHING that + touches a shared variable has to agree about the spinlock they want + to use. + +---- + +Lesson 2: reader-writer spinlocks. +================================== + +If your data accesses have a very natural pattern where you usually tend +to mostly read from the shared variables, the reader-writer locks +(rw_lock) versions of the spinlocks are sometimes useful. They allow multiple +readers to be in the same critical region at once, but if somebody wants +to change the variables it has to get an exclusive write lock. + + NOTE! reader-writer locks require more atomic memory operations than + simple spinlocks. Unless the reader critical section is long, you + are better off just using spinlocks. + +The routines look the same as above:: + + rwlock_t xxx_lock = __RW_LOCK_UNLOCKED(xxx_lock); + + unsigned long flags; + + read_lock_irqsave(&xxx_lock, flags); + .. critical section that only reads the info ... + read_unlock_irqrestore(&xxx_lock, flags); + + write_lock_irqsave(&xxx_lock, flags); + .. read and write exclusive access to the info ... + write_unlock_irqrestore(&xxx_lock, flags); + +The above kind of lock may be useful for complex data structures like +linked lists, especially searching for entries without changing the list +itself. The read lock allows many concurrent readers. Anything that +**changes** the list will have to get the write lock. + + NOTE! RCU is better for list traversal, but requires careful + attention to design detail (see Documentation/RCU/listRCU.txt). + +Also, you cannot "upgrade" a read-lock to a write-lock, so if you at _any_ +time need to do any changes (even if you don't do it every time), you have +to get the write-lock at the very beginning. + + NOTE! We are working hard to remove reader-writer spinlocks in most + cases, so please don't add a new one without consensus. (Instead, see + Documentation/RCU/rcu.txt for complete information.) + +---- + +Lesson 3: spinlocks revisited. +============================== + +The single spin-lock primitives above are by no means the only ones. They +are the most safe ones, and the ones that work under all circumstances, +but partly **because** they are safe they are also fairly slow. They are slower +than they'd need to be, because they do have to disable interrupts +(which is just a single instruction on a x86, but it's an expensive one - +and on other architectures it can be worse). + +If you have a case where you have to protect a data structure across +several CPU's and you want to use spinlocks you can potentially use +cheaper versions of the spinlocks. IFF you know that the spinlocks are +never used in interrupt handlers, you can use the non-irq versions:: + + spin_lock(&lock); + ... + spin_unlock(&lock); + +(and the equivalent read-write versions too, of course). The spinlock will +guarantee the same kind of exclusive access, and it will be much faster. +This is useful if you know that the data in question is only ever +manipulated from a "process context", ie no interrupts involved. + +The reasons you mustn't use these versions if you have interrupts that +play with the spinlock is that you can get deadlocks:: + + spin_lock(&lock); + ... + <- interrupt comes in: + spin_lock(&lock); + +where an interrupt tries to lock an already locked variable. This is ok if +the other interrupt happens on another CPU, but it is _not_ ok if the +interrupt happens on the same CPU that already holds the lock, because the +lock will obviously never be released (because the interrupt is waiting +for the lock, and the lock-holder is interrupted by the interrupt and will +not continue until the interrupt has been processed). + +(This is also the reason why the irq-versions of the spinlocks only need +to disable the _local_ interrupts - it's ok to use spinlocks in interrupts +on other CPU's, because an interrupt on another CPU doesn't interrupt the +CPU that holds the lock, so the lock-holder can continue and eventually +releases the lock). + +Note that you can be clever with read-write locks and interrupts. For +example, if you know that the interrupt only ever gets a read-lock, then +you can use a non-irq version of read locks everywhere - because they +don't block on each other (and thus there is no dead-lock wrt interrupts. +But when you do the write-lock, you have to use the irq-safe version. + +For an example of being clever with rw-locks, see the "waitqueue_lock" +handling in kernel/sched/core.c - nothing ever _changes_ a wait-queue from +within an interrupt, they only read the queue in order to know whom to +wake up. So read-locks are safe (which is good: they are very common +indeed), while write-locks need to protect themselves against interrupts. + + Linus + +---- + +Reference information: +====================== + +For dynamic initialization, use spin_lock_init() or rwlock_init() as +appropriate:: + + spinlock_t xxx_lock; + rwlock_t xxx_rw_lock; + + static int __init xxx_init(void) + { + spin_lock_init(&xxx_lock); + rwlock_init(&xxx_rw_lock); + ... + } + + module_init(xxx_init); + +For static initialization, use DEFINE_SPINLOCK() / DEFINE_RWLOCK() or +__SPIN_LOCK_UNLOCKED() / __RW_LOCK_UNLOCKED() as appropriate. diff --git a/Documentation/locking/spinlocks.txt b/Documentation/locking/spinlocks.txt deleted file mode 100644 index ff35e40bdf5b..000000000000 --- a/Documentation/locking/spinlocks.txt +++ /dev/null @@ -1,167 +0,0 @@ -Lesson 1: Spin locks - -The most basic primitive for locking is spinlock. - -static DEFINE_SPINLOCK(xxx_lock); - - unsigned long flags; - - spin_lock_irqsave(&xxx_lock, flags); - ... critical section here .. - spin_unlock_irqrestore(&xxx_lock, flags); - -The above is always safe. It will disable interrupts _locally_, but the -spinlock itself will guarantee the global lock, so it will guarantee that -there is only one thread-of-control within the region(s) protected by that -lock. This works well even under UP also, so the code does _not_ need to -worry about UP vs SMP issues: the spinlocks work correctly under both. - - NOTE! Implications of spin_locks for memory are further described in: - - Documentation/memory-barriers.txt - (5) LOCK operations. - (6) UNLOCK operations. - -The above is usually pretty simple (you usually need and want only one -spinlock for most things - using more than one spinlock can make things a -lot more complex and even slower and is usually worth it only for -sequences that you _know_ need to be split up: avoid it at all cost if you -aren't sure). - -This is really the only really hard part about spinlocks: once you start -using spinlocks they tend to expand to areas you might not have noticed -before, because you have to make sure the spinlocks correctly protect the -shared data structures _everywhere_ they are used. The spinlocks are most -easily added to places that are completely independent of other code (for -example, internal driver data structures that nobody else ever touches). - - NOTE! The spin-lock is safe only when you _also_ use the lock itself - to do locking across CPU's, which implies that EVERYTHING that - touches a shared variable has to agree about the spinlock they want - to use. - ----- - -Lesson 2: reader-writer spinlocks. - -If your data accesses have a very natural pattern where you usually tend -to mostly read from the shared variables, the reader-writer locks -(rw_lock) versions of the spinlocks are sometimes useful. They allow multiple -readers to be in the same critical region at once, but if somebody wants -to change the variables it has to get an exclusive write lock. - - NOTE! reader-writer locks require more atomic memory operations than - simple spinlocks. Unless the reader critical section is long, you - are better off just using spinlocks. - -The routines look the same as above: - - rwlock_t xxx_lock = __RW_LOCK_UNLOCKED(xxx_lock); - - unsigned long flags; - - read_lock_irqsave(&xxx_lock, flags); - .. critical section that only reads the info ... - read_unlock_irqrestore(&xxx_lock, flags); - - write_lock_irqsave(&xxx_lock, flags); - .. read and write exclusive access to the info ... - write_unlock_irqrestore(&xxx_lock, flags); - -The above kind of lock may be useful for complex data structures like -linked lists, especially searching for entries without changing the list -itself. The read lock allows many concurrent readers. Anything that -_changes_ the list will have to get the write lock. - - NOTE! RCU is better for list traversal, but requires careful - attention to design detail (see Documentation/RCU/listRCU.txt). - -Also, you cannot "upgrade" a read-lock to a write-lock, so if you at _any_ -time need to do any changes (even if you don't do it every time), you have -to get the write-lock at the very beginning. - - NOTE! We are working hard to remove reader-writer spinlocks in most - cases, so please don't add a new one without consensus. (Instead, see - Documentation/RCU/rcu.txt for complete information.) - ----- - -Lesson 3: spinlocks revisited. - -The single spin-lock primitives above are by no means the only ones. They -are the most safe ones, and the ones that work under all circumstances, -but partly _because_ they are safe they are also fairly slow. They are slower -than they'd need to be, because they do have to disable interrupts -(which is just a single instruction on a x86, but it's an expensive one - -and on other architectures it can be worse). - -If you have a case where you have to protect a data structure across -several CPU's and you want to use spinlocks you can potentially use -cheaper versions of the spinlocks. IFF you know that the spinlocks are -never used in interrupt handlers, you can use the non-irq versions: - - spin_lock(&lock); - ... - spin_unlock(&lock); - -(and the equivalent read-write versions too, of course). The spinlock will -guarantee the same kind of exclusive access, and it will be much faster. -This is useful if you know that the data in question is only ever -manipulated from a "process context", ie no interrupts involved. - -The reasons you mustn't use these versions if you have interrupts that -play with the spinlock is that you can get deadlocks: - - spin_lock(&lock); - ... - <- interrupt comes in: - spin_lock(&lock); - -where an interrupt tries to lock an already locked variable. This is ok if -the other interrupt happens on another CPU, but it is _not_ ok if the -interrupt happens on the same CPU that already holds the lock, because the -lock will obviously never be released (because the interrupt is waiting -for the lock, and the lock-holder is interrupted by the interrupt and will -not continue until the interrupt has been processed). - -(This is also the reason why the irq-versions of the spinlocks only need -to disable the _local_ interrupts - it's ok to use spinlocks in interrupts -on other CPU's, because an interrupt on another CPU doesn't interrupt the -CPU that holds the lock, so the lock-holder can continue and eventually -releases the lock). - -Note that you can be clever with read-write locks and interrupts. For -example, if you know that the interrupt only ever gets a read-lock, then -you can use a non-irq version of read locks everywhere - because they -don't block on each other (and thus there is no dead-lock wrt interrupts. -But when you do the write-lock, you have to use the irq-safe version. - -For an example of being clever with rw-locks, see the "waitqueue_lock" -handling in kernel/sched/core.c - nothing ever _changes_ a wait-queue from -within an interrupt, they only read the queue in order to know whom to -wake up. So read-locks are safe (which is good: they are very common -indeed), while write-locks need to protect themselves against interrupts. - - Linus - ----- - -Reference information: - -For dynamic initialization, use spin_lock_init() or rwlock_init() as -appropriate: - - spinlock_t xxx_lock; - rwlock_t xxx_rw_lock; - - static int __init xxx_init(void) - { - spin_lock_init(&xxx_lock); - rwlock_init(&xxx_rw_lock); - ... - } - - module_init(xxx_init); - -For static initialization, use DEFINE_SPINLOCK() / DEFINE_RWLOCK() or -__SPIN_LOCK_UNLOCKED() / __RW_LOCK_UNLOCKED() as appropriate. diff --git a/Documentation/locking/ww-mutex-design.rst b/Documentation/locking/ww-mutex-design.rst new file mode 100644 index 000000000000..1846c199da23 --- /dev/null +++ b/Documentation/locking/ww-mutex-design.rst @@ -0,0 +1,393 @@ +====================================== +Wound/Wait Deadlock-Proof Mutex Design +====================================== + +Please read mutex-design.txt first, as it applies to wait/wound mutexes too. + +Motivation for WW-Mutexes +------------------------- + +GPU's do operations that commonly involve many buffers. Those buffers +can be shared across contexts/processes, exist in different memory +domains (for example VRAM vs system memory), and so on. And with +PRIME / dmabuf, they can even be shared across devices. So there are +a handful of situations where the driver needs to wait for buffers to +become ready. If you think about this in terms of waiting on a buffer +mutex for it to become available, this presents a problem because +there is no way to guarantee that buffers appear in a execbuf/batch in +the same order in all contexts. That is directly under control of +userspace, and a result of the sequence of GL calls that an application +makes. Which results in the potential for deadlock. The problem gets +more complex when you consider that the kernel may need to migrate the +buffer(s) into VRAM before the GPU operates on the buffer(s), which +may in turn require evicting some other buffers (and you don't want to +evict other buffers which are already queued up to the GPU), but for a +simplified understanding of the problem you can ignore this. + +The algorithm that the TTM graphics subsystem came up with for dealing with +this problem is quite simple. For each group of buffers (execbuf) that need +to be locked, the caller would be assigned a unique reservation id/ticket, +from a global counter. In case of deadlock while locking all the buffers +associated with a execbuf, the one with the lowest reservation ticket (i.e. +the oldest task) wins, and the one with the higher reservation id (i.e. the +younger task) unlocks all of the buffers that it has already locked, and then +tries again. + +In the RDBMS literature, a reservation ticket is associated with a transaction. +and the deadlock handling approach is called Wait-Die. The name is based on +the actions of a locking thread when it encounters an already locked mutex. +If the transaction holding the lock is younger, the locking transaction waits. +If the transaction holding the lock is older, the locking transaction backs off +and dies. Hence Wait-Die. +There is also another algorithm called Wound-Wait: +If the transaction holding the lock is younger, the locking transaction +wounds the transaction holding the lock, requesting it to die. +If the transaction holding the lock is older, it waits for the other +transaction. Hence Wound-Wait. +The two algorithms are both fair in that a transaction will eventually succeed. +However, the Wound-Wait algorithm is typically stated to generate fewer backoffs +compared to Wait-Die, but is, on the other hand, associated with more work than +Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive +algorithm in that transactions are wounded by other transactions, and that +requires a reliable way to pick up up the wounded condition and preempt the +running transaction. Note that this is not the same as process preemption. A +Wound-Wait transaction is considered preempted when it dies (returning +-EDEADLK) following a wound. + +Concepts +-------- + +Compared to normal mutexes two additional concepts/objects show up in the lock +interface for w/w mutexes: + +Acquire context: To ensure eventual forward progress it is important the a task +trying to acquire locks doesn't grab a new reservation id, but keeps the one it +acquired when starting the lock acquisition. This ticket is stored in the +acquire context. Furthermore the acquire context keeps track of debugging state +to catch w/w mutex interface abuse. An acquire context is representing a +transaction. + +W/w class: In contrast to normal mutexes the lock class needs to be explicit for +w/w mutexes, since it is required to initialize the acquire context. The lock +class also specifies what algorithm to use, Wound-Wait or Wait-Die. + +Furthermore there are three different class of w/w lock acquire functions: + +* Normal lock acquisition with a context, using ww_mutex_lock. + +* Slowpath lock acquisition on the contending lock, used by the task that just + killed its transaction after having dropped all already acquired locks. + These functions have the _slow postfix. + + From a simple semantics point-of-view the _slow functions are not strictly + required, since simply calling the normal ww_mutex_lock functions on the + contending lock (after having dropped all other already acquired locks) will + work correctly. After all if no other ww mutex has been acquired yet there's + no deadlock potential and hence the ww_mutex_lock call will block and not + prematurely return -EDEADLK. The advantage of the _slow functions is in + interface safety: + + - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow + has a void return type. Note that since ww mutex code needs loops/retries + anyway the __must_check doesn't result in spurious warnings, even though the + very first lock operation can never fail. + - When full debugging is enabled ww_mutex_lock_slow checks that all acquired + ww mutex have been released (preventing deadlocks) and makes sure that we + block on the contending lock (preventing spinning through the -EDEADLK + slowpath until the contended lock can be acquired). + +* Functions to only acquire a single w/w mutex, which results in the exact same + semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL + context. + + Again this is not strictly required. But often you only want to acquire a + single lock in which case it's pointless to set up an acquire context (and so + better to avoid grabbing a deadlock avoidance ticket). + +Of course, all the usual variants for handling wake-ups due to signals are also +provided. + +Usage +----- + +The algorithm (Wait-Die vs Wound-Wait) is chosen by using either +DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die) +As a rough rule of thumb, use Wound-Wait iff you +expect the number of simultaneous competing transactions to be typically small, +and you want to reduce the number of rollbacks. + +Three different ways to acquire locks within the same w/w class. Common +definitions for methods #1 and #2:: + + static DEFINE_WW_CLASS(ww_class); + + struct obj { + struct ww_mutex lock; + /* obj data */ + }; + + struct obj_entry { + struct list_head head; + struct obj *obj; + }; + +Method 1, using a list in execbuf->buffers that's not allowed to be reordered. +This is useful if a list of required objects is already tracked somewhere. +Furthermore the lock helper can use propagate the -EALREADY return code back to +the caller as a signal that an object is twice on the list. This is useful if +the list is constructed from userspace input and the ABI requires userspace to +not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl):: + + int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) + { + struct obj *res_obj = NULL; + struct obj_entry *contended_entry = NULL; + struct obj_entry *entry; + + ww_acquire_init(ctx, &ww_class); + + retry: + list_for_each_entry (entry, list, head) { + if (entry->obj == res_obj) { + res_obj = NULL; + continue; + } + ret = ww_mutex_lock(&entry->obj->lock, ctx); + if (ret < 0) { + contended_entry = entry; + goto err; + } + } + + ww_acquire_done(ctx); + return 0; + + err: + list_for_each_entry_continue_reverse (entry, list, head) + ww_mutex_unlock(&entry->obj->lock); + + if (res_obj) + ww_mutex_unlock(&res_obj->lock); + + if (ret == -EDEADLK) { + /* we lost out in a seqno race, lock and retry.. */ + ww_mutex_lock_slow(&contended_entry->obj->lock, ctx); + res_obj = contended_entry->obj; + goto retry; + } + ww_acquire_fini(ctx); + + return ret; + } + +Method 2, using a list in execbuf->buffers that can be reordered. Same semantics +of duplicate entry detection using -EALREADY as method 1 above. But the +list-reordering allows for a bit more idiomatic code:: + + int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) + { + struct obj_entry *entry, *entry2; + + ww_acquire_init(ctx, &ww_class); + + list_for_each_entry (entry, list, head) { + ret = ww_mutex_lock(&entry->obj->lock, ctx); + if (ret < 0) { + entry2 = entry; + + list_for_each_entry_continue_reverse (entry2, list, head) + ww_mutex_unlock(&entry2->obj->lock); + + if (ret != -EDEADLK) { + ww_acquire_fini(ctx); + return ret; + } + + /* we lost out in a seqno race, lock and retry.. */ + ww_mutex_lock_slow(&entry->obj->lock, ctx); + + /* + * Move buf to head of the list, this will point + * buf->next to the first unlocked entry, + * restarting the for loop. + */ + list_del(&entry->head); + list_add(&entry->head, list); + } + } + + ww_acquire_done(ctx); + return 0; + } + +Unlocking works the same way for both methods #1 and #2:: + + void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) + { + struct obj_entry *entry; + + list_for_each_entry (entry, list, head) + ww_mutex_unlock(&entry->obj->lock); + + ww_acquire_fini(ctx); + } + +Method 3 is useful if the list of objects is constructed ad-hoc and not upfront, +e.g. when adjusting edges in a graph where each node has its own ww_mutex lock, +and edges can only be changed when holding the locks of all involved nodes. w/w +mutexes are a natural fit for such a case for two reasons: + +- They can handle lock-acquisition in any order which allows us to start walking + a graph from a starting point and then iteratively discovering new edges and + locking down the nodes those edges connect to. +- Due to the -EALREADY return code signalling that a given objects is already + held there's no need for additional book-keeping to break cycles in the graph + or keep track off which looks are already held (when using more than one node + as a starting point). + +Note that this approach differs in two important ways from the above methods: + +- Since the list of objects is dynamically constructed (and might very well be + different when retrying due to hitting the -EDEADLK die condition) there's + no need to keep any object on a persistent list when it's not locked. We can + therefore move the list_head into the object itself. +- On the other hand the dynamic object list construction also means that the -EALREADY return + code can't be propagated. + +Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a +list of starting nodes (passed in from userspace) using one of the above +methods. And then lock any additional objects affected by the operations using +method #3 below. The backoff/retry procedure will be a bit more involved, since +when the dynamic locking step hits -EDEADLK we also need to unlock all the +objects acquired with the fixed list. But the w/w mutex debug checks will catch +any interface misuse for these cases. + +Also, method 3 can't fail the lock acquisition step since it doesn't return +-EALREADY. Of course this would be different when using the _interruptible +variants, but that's outside of the scope of these examples here:: + + struct obj { + struct ww_mutex ww_mutex; + struct list_head locked_list; + }; + + static DEFINE_WW_CLASS(ww_class); + + void __unlock_objs(struct list_head *list) + { + struct obj *entry, *temp; + + list_for_each_entry_safe (entry, temp, list, locked_list) { + /* need to do that before unlocking, since only the current lock holder is + allowed to use object */ + list_del(&entry->locked_list); + ww_mutex_unlock(entry->ww_mutex) + } + } + + void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) + { + struct obj *obj; + + ww_acquire_init(ctx, &ww_class); + + retry: + /* re-init loop start state */ + loop { + /* magic code which walks over a graph and decides which objects + * to lock */ + + ret = ww_mutex_lock(obj->ww_mutex, ctx); + if (ret == -EALREADY) { + /* we have that one already, get to the next object */ + continue; + } + if (ret == -EDEADLK) { + __unlock_objs(list); + + ww_mutex_lock_slow(obj, ctx); + list_add(&entry->locked_list, list); + goto retry; + } + + /* locked a new object, add it to the list */ + list_add_tail(&entry->locked_list, list); + } + + ww_acquire_done(ctx); + return 0; + } + + void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) + { + __unlock_objs(list); + ww_acquire_fini(ctx); + } + +Method 4: Only lock one single objects. In that case deadlock detection and +prevention is obviously overkill, since with grabbing just one lock you can't +produce a deadlock within just one class. To simplify this case the w/w mutex +api can be used with a NULL context. + +Implementation Details +---------------------- + +Design: +^^^^^^^ + + ww_mutex currently encapsulates a struct mutex, this means no extra overhead for + normal mutex locks, which are far more common. As such there is only a small + increase in code size if wait/wound mutexes are not used. + + We maintain the following invariants for the wait list: + + (1) Waiters with an acquire context are sorted by stamp order; waiters + without an acquire context are interspersed in FIFO order. + (2) For Wait-Die, among waiters with contexts, only the first one can have + other locks acquired already (ctx->acquired > 0). Note that this waiter + may come after other waiters without contexts in the list. + + The Wound-Wait preemption is implemented with a lazy-preemption scheme: + The wounded status of the transaction is checked only when there is + contention for a new lock and hence a true chance of deadlock. In that + situation, if the transaction is wounded, it backs off, clears the + wounded status and retries. A great benefit of implementing preemption in + this way is that the wounded transaction can identify a contending lock to + wait for before restarting the transaction. Just blindly restarting the + transaction would likely make the transaction end up in a situation where + it would have to back off again. + + In general, not much contention is expected. The locks are typically used to + serialize access to resources for devices, and optimization focus should + therefore be directed towards the uncontended cases. + +Lockdep: +^^^^^^^^ + + Special care has been taken to warn for as many cases of api abuse + as possible. Some common api abuses will be caught with + CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended. + + Some of the errors which will be warned about: + - Forgetting to call ww_acquire_fini or ww_acquire_init. + - Attempting to lock more mutexes after ww_acquire_done. + - Attempting to lock the wrong mutex after -EDEADLK and + unlocking all mutexes. + - Attempting to lock the right mutex after -EDEADLK, + before unlocking all mutexes. + + - Calling ww_mutex_lock_slow before -EDEADLK was returned. + + - Unlocking mutexes with the wrong unlock function. + - Calling one of the ww_acquire_* twice on the same context. + - Using a different ww_class for the mutex than for the ww_acquire_ctx. + - Normal lockdep errors that can result in deadlocks. + + Some of the lockdep errors that can result in deadlocks: + - Calling ww_acquire_init to initialize a second ww_acquire_ctx before + having called ww_acquire_fini on the first. + - 'normal' deadlocks that can occur. + +FIXME: + Update this section once we have the TASK_DEADLOCK task state flag magic + implemented. diff --git a/Documentation/locking/ww-mutex-design.txt b/Documentation/locking/ww-mutex-design.txt deleted file mode 100644 index f0ed7c30e695..000000000000 --- a/Documentation/locking/ww-mutex-design.txt +++ /dev/null @@ -1,383 +0,0 @@ -Wound/Wait Deadlock-Proof Mutex Design -====================================== - -Please read mutex-design.txt first, as it applies to wait/wound mutexes too. - -Motivation for WW-Mutexes -------------------------- - -GPU's do operations that commonly involve many buffers. Those buffers -can be shared across contexts/processes, exist in different memory -domains (for example VRAM vs system memory), and so on. And with -PRIME / dmabuf, they can even be shared across devices. So there are -a handful of situations where the driver needs to wait for buffers to -become ready. If you think about this in terms of waiting on a buffer -mutex for it to become available, this presents a problem because -there is no way to guarantee that buffers appear in a execbuf/batch in -the same order in all contexts. That is directly under control of -userspace, and a result of the sequence of GL calls that an application -makes. Which results in the potential for deadlock. The problem gets -more complex when you consider that the kernel may need to migrate the -buffer(s) into VRAM before the GPU operates on the buffer(s), which -may in turn require evicting some other buffers (and you don't want to -evict other buffers which are already queued up to the GPU), but for a -simplified understanding of the problem you can ignore this. - -The algorithm that the TTM graphics subsystem came up with for dealing with -this problem is quite simple. For each group of buffers (execbuf) that need -to be locked, the caller would be assigned a unique reservation id/ticket, -from a global counter. In case of deadlock while locking all the buffers -associated with a execbuf, the one with the lowest reservation ticket (i.e. -the oldest task) wins, and the one with the higher reservation id (i.e. the -younger task) unlocks all of the buffers that it has already locked, and then -tries again. - -In the RDBMS literature, a reservation ticket is associated with a transaction. -and the deadlock handling approach is called Wait-Die. The name is based on -the actions of a locking thread when it encounters an already locked mutex. -If the transaction holding the lock is younger, the locking transaction waits. -If the transaction holding the lock is older, the locking transaction backs off -and dies. Hence Wait-Die. -There is also another algorithm called Wound-Wait: -If the transaction holding the lock is younger, the locking transaction -wounds the transaction holding the lock, requesting it to die. -If the transaction holding the lock is older, it waits for the other -transaction. Hence Wound-Wait. -The two algorithms are both fair in that a transaction will eventually succeed. -However, the Wound-Wait algorithm is typically stated to generate fewer backoffs -compared to Wait-Die, but is, on the other hand, associated with more work than -Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive -algorithm in that transactions are wounded by other transactions, and that -requires a reliable way to pick up up the wounded condition and preempt the -running transaction. Note that this is not the same as process preemption. A -Wound-Wait transaction is considered preempted when it dies (returning --EDEADLK) following a wound. - -Concepts --------- - -Compared to normal mutexes two additional concepts/objects show up in the lock -interface for w/w mutexes: - -Acquire context: To ensure eventual forward progress it is important the a task -trying to acquire locks doesn't grab a new reservation id, but keeps the one it -acquired when starting the lock acquisition. This ticket is stored in the -acquire context. Furthermore the acquire context keeps track of debugging state -to catch w/w mutex interface abuse. An acquire context is representing a -transaction. - -W/w class: In contrast to normal mutexes the lock class needs to be explicit for -w/w mutexes, since it is required to initialize the acquire context. The lock -class also specifies what algorithm to use, Wound-Wait or Wait-Die. - -Furthermore there are three different class of w/w lock acquire functions: - -* Normal lock acquisition with a context, using ww_mutex_lock. - -* Slowpath lock acquisition on the contending lock, used by the task that just - killed its transaction after having dropped all already acquired locks. - These functions have the _slow postfix. - - From a simple semantics point-of-view the _slow functions are not strictly - required, since simply calling the normal ww_mutex_lock functions on the - contending lock (after having dropped all other already acquired locks) will - work correctly. After all if no other ww mutex has been acquired yet there's - no deadlock potential and hence the ww_mutex_lock call will block and not - prematurely return -EDEADLK. The advantage of the _slow functions is in - interface safety: - - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow - has a void return type. Note that since ww mutex code needs loops/retries - anyway the __must_check doesn't result in spurious warnings, even though the - very first lock operation can never fail. - - When full debugging is enabled ww_mutex_lock_slow checks that all acquired - ww mutex have been released (preventing deadlocks) and makes sure that we - block on the contending lock (preventing spinning through the -EDEADLK - slowpath until the contended lock can be acquired). - -* Functions to only acquire a single w/w mutex, which results in the exact same - semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL - context. - - Again this is not strictly required. But often you only want to acquire a - single lock in which case it's pointless to set up an acquire context (and so - better to avoid grabbing a deadlock avoidance ticket). - -Of course, all the usual variants for handling wake-ups due to signals are also -provided. - -Usage ------ - -The algorithm (Wait-Die vs Wound-Wait) is chosen by using either -DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die) -As a rough rule of thumb, use Wound-Wait iff you -expect the number of simultaneous competing transactions to be typically small, -and you want to reduce the number of rollbacks. - -Three different ways to acquire locks within the same w/w class. Common -definitions for methods #1 and #2: - -static DEFINE_WW_CLASS(ww_class); - -struct obj { - struct ww_mutex lock; - /* obj data */ -}; - -struct obj_entry { - struct list_head head; - struct obj *obj; -}; - -Method 1, using a list in execbuf->buffers that's not allowed to be reordered. -This is useful if a list of required objects is already tracked somewhere. -Furthermore the lock helper can use propagate the -EALREADY return code back to -the caller as a signal that an object is twice on the list. This is useful if -the list is constructed from userspace input and the ABI requires userspace to -not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl). - -int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) -{ - struct obj *res_obj = NULL; - struct obj_entry *contended_entry = NULL; - struct obj_entry *entry; - - ww_acquire_init(ctx, &ww_class); - -retry: - list_for_each_entry (entry, list, head) { - if (entry->obj == res_obj) { - res_obj = NULL; - continue; - } - ret = ww_mutex_lock(&entry->obj->lock, ctx); - if (ret < 0) { - contended_entry = entry; - goto err; - } - } - - ww_acquire_done(ctx); - return 0; - -err: - list_for_each_entry_continue_reverse (entry, list, head) - ww_mutex_unlock(&entry->obj->lock); - - if (res_obj) - ww_mutex_unlock(&res_obj->lock); - - if (ret == -EDEADLK) { - /* we lost out in a seqno race, lock and retry.. */ - ww_mutex_lock_slow(&contended_entry->obj->lock, ctx); - res_obj = contended_entry->obj; - goto retry; - } - ww_acquire_fini(ctx); - - return ret; -} - -Method 2, using a list in execbuf->buffers that can be reordered. Same semantics -of duplicate entry detection using -EALREADY as method 1 above. But the -list-reordering allows for a bit more idiomatic code. - -int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) -{ - struct obj_entry *entry, *entry2; - - ww_acquire_init(ctx, &ww_class); - - list_for_each_entry (entry, list, head) { - ret = ww_mutex_lock(&entry->obj->lock, ctx); - if (ret < 0) { - entry2 = entry; - - list_for_each_entry_continue_reverse (entry2, list, head) - ww_mutex_unlock(&entry2->obj->lock); - - if (ret != -EDEADLK) { - ww_acquire_fini(ctx); - return ret; - } - - /* we lost out in a seqno race, lock and retry.. */ - ww_mutex_lock_slow(&entry->obj->lock, ctx); - - /* - * Move buf to head of the list, this will point - * buf->next to the first unlocked entry, - * restarting the for loop. - */ - list_del(&entry->head); - list_add(&entry->head, list); - } - } - - ww_acquire_done(ctx); - return 0; -} - -Unlocking works the same way for both methods #1 and #2: - -void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) -{ - struct obj_entry *entry; - - list_for_each_entry (entry, list, head) - ww_mutex_unlock(&entry->obj->lock); - - ww_acquire_fini(ctx); -} - -Method 3 is useful if the list of objects is constructed ad-hoc and not upfront, -e.g. when adjusting edges in a graph where each node has its own ww_mutex lock, -and edges can only be changed when holding the locks of all involved nodes. w/w -mutexes are a natural fit for such a case for two reasons: -- They can handle lock-acquisition in any order which allows us to start walking - a graph from a starting point and then iteratively discovering new edges and - locking down the nodes those edges connect to. -- Due to the -EALREADY return code signalling that a given objects is already - held there's no need for additional book-keeping to break cycles in the graph - or keep track off which looks are already held (when using more than one node - as a starting point). - -Note that this approach differs in two important ways from the above methods: -- Since the list of objects is dynamically constructed (and might very well be - different when retrying due to hitting the -EDEADLK die condition) there's - no need to keep any object on a persistent list when it's not locked. We can - therefore move the list_head into the object itself. -- On the other hand the dynamic object list construction also means that the -EALREADY return - code can't be propagated. - -Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a -list of starting nodes (passed in from userspace) using one of the above -methods. And then lock any additional objects affected by the operations using -method #3 below. The backoff/retry procedure will be a bit more involved, since -when the dynamic locking step hits -EDEADLK we also need to unlock all the -objects acquired with the fixed list. But the w/w mutex debug checks will catch -any interface misuse for these cases. - -Also, method 3 can't fail the lock acquisition step since it doesn't return --EALREADY. Of course this would be different when using the _interruptible -variants, but that's outside of the scope of these examples here. - -struct obj { - struct ww_mutex ww_mutex; - struct list_head locked_list; -}; - -static DEFINE_WW_CLASS(ww_class); - -void __unlock_objs(struct list_head *list) -{ - struct obj *entry, *temp; - - list_for_each_entry_safe (entry, temp, list, locked_list) { - /* need to do that before unlocking, since only the current lock holder is - allowed to use object */ - list_del(&entry->locked_list); - ww_mutex_unlock(entry->ww_mutex) - } -} - -void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) -{ - struct obj *obj; - - ww_acquire_init(ctx, &ww_class); - -retry: - /* re-init loop start state */ - loop { - /* magic code which walks over a graph and decides which objects - * to lock */ - - ret = ww_mutex_lock(obj->ww_mutex, ctx); - if (ret == -EALREADY) { - /* we have that one already, get to the next object */ - continue; - } - if (ret == -EDEADLK) { - __unlock_objs(list); - - ww_mutex_lock_slow(obj, ctx); - list_add(&entry->locked_list, list); - goto retry; - } - - /* locked a new object, add it to the list */ - list_add_tail(&entry->locked_list, list); - } - - ww_acquire_done(ctx); - return 0; -} - -void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) -{ - __unlock_objs(list); - ww_acquire_fini(ctx); -} - -Method 4: Only lock one single objects. In that case deadlock detection and -prevention is obviously overkill, since with grabbing just one lock you can't -produce a deadlock within just one class. To simplify this case the w/w mutex -api can be used with a NULL context. - -Implementation Details ----------------------- - -Design: - ww_mutex currently encapsulates a struct mutex, this means no extra overhead for - normal mutex locks, which are far more common. As such there is only a small - increase in code size if wait/wound mutexes are not used. - - We maintain the following invariants for the wait list: - (1) Waiters with an acquire context are sorted by stamp order; waiters - without an acquire context are interspersed in FIFO order. - (2) For Wait-Die, among waiters with contexts, only the first one can have - other locks acquired already (ctx->acquired > 0). Note that this waiter - may come after other waiters without contexts in the list. - - The Wound-Wait preemption is implemented with a lazy-preemption scheme: - The wounded status of the transaction is checked only when there is - contention for a new lock and hence a true chance of deadlock. In that - situation, if the transaction is wounded, it backs off, clears the - wounded status and retries. A great benefit of implementing preemption in - this way is that the wounded transaction can identify a contending lock to - wait for before restarting the transaction. Just blindly restarting the - transaction would likely make the transaction end up in a situation where - it would have to back off again. - - In general, not much contention is expected. The locks are typically used to - serialize access to resources for devices, and optimization focus should - therefore be directed towards the uncontended cases. - -Lockdep: - Special care has been taken to warn for as many cases of api abuse - as possible. Some common api abuses will be caught with - CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended. - - Some of the errors which will be warned about: - - Forgetting to call ww_acquire_fini or ww_acquire_init. - - Attempting to lock more mutexes after ww_acquire_done. - - Attempting to lock the wrong mutex after -EDEADLK and - unlocking all mutexes. - - Attempting to lock the right mutex after -EDEADLK, - before unlocking all mutexes. - - - Calling ww_mutex_lock_slow before -EDEADLK was returned. - - - Unlocking mutexes with the wrong unlock function. - - Calling one of the ww_acquire_* twice on the same context. - - Using a different ww_class for the mutex than for the ww_acquire_ctx. - - Normal lockdep errors that can result in deadlocks. - - Some of the lockdep errors that can result in deadlocks: - - Calling ww_acquire_init to initialize a second ww_acquire_ctx before - having called ww_acquire_fini on the first. - - 'normal' deadlocks that can occur. - -FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic -implemented. diff --git a/Documentation/pi-futex.txt b/Documentation/pi-futex.txt index b154f6c0c36e..c33ba2befbf8 100644 --- a/Documentation/pi-futex.txt +++ b/Documentation/pi-futex.txt @@ -119,4 +119,4 @@ properties of futexes, and all four combinations are possible: futex, robust-futex, PI-futex, robust+PI-futex. More details about priority inheritance can be found in -Documentation/locking/rt-mutex.txt. +Documentation/locking/rt-mutex.rst. diff --git a/Documentation/translations/it_IT/kernel-hacking/locking.rst b/Documentation/translations/it_IT/kernel-hacking/locking.rst index 5fd8a1abd2be..b9a6be4b8499 100644 --- a/Documentation/translations/it_IT/kernel-hacking/locking.rst +++ b/Documentation/translations/it_IT/kernel-hacking/locking.rst @@ -1404,7 +1404,7 @@ Riferimento per l'API dei Futex Approfondimenti =============== -- ``Documentation/locking/spinlocks.txt``: la guida di Linus Torvalds agli +- ``Documentation/locking/spinlocks.rst``: la guida di Linus Torvalds agli spinlock del kernel. - Unix Systems for Modern Architectures: Symmetric Multiprocessing and -- cgit v1.2.3