From 326bc876fed4fad2f46dd1be637e90e1b525ee8c Mon Sep 17 00:00:00 2001 From: Silvio Fricke Date: Mon, 28 Nov 2016 18:30:55 +0100 Subject: Documentation/atomic_ops.txt: convert to ReST markup ... and move to core-api folder. Signed-off-by: Silvio Fricke Signed-off-by: Jonathan Corbet --- Documentation/atomic_ops.txt | 640 -------------------- Documentation/core-api/atomic_ops.rst | 658 +++++++++++++++++++++ Documentation/core-api/index.rst | 1 + .../process/volatile-considered-harmful.rst | 3 + 4 files changed, 662 insertions(+), 640 deletions(-) delete mode 100644 Documentation/atomic_ops.txt create mode 100644 Documentation/core-api/atomic_ops.rst diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt deleted file mode 100644 index 6c5e8a9d2c6e..000000000000 --- a/Documentation/atomic_ops.txt +++ /dev/null @@ -1,640 +0,0 @@ - Semantics and Behavior of Atomic and - Bitmask Operations - - David S. Miller - - This document is intended to serve as a guide to Linux port -maintainers on how to implement atomic counter, bitops, and spinlock -interfaces properly. - - The atomic_t type should be defined as a signed integer and -the atomic_long_t type as a signed long integer. Also, they should -be made opaque such that any kind of cast to a normal C integer type -will fail. Something like the following should suffice: - - typedef struct { int counter; } atomic_t; - typedef struct { long counter; } atomic_long_t; - -Historically, counter has been declared volatile. This is now discouraged. -See Documentation/process/volatile-considered-harmful.rst for the complete rationale. - -local_t is very similar to atomic_t. If the counter is per CPU and only -updated by one CPU, local_t is probably more appropriate. Please see -Documentation/local_ops.txt for the semantics of local_t. - -The first operations to implement for atomic_t's are the initializers and -plain reads. - - #define ATOMIC_INIT(i) { (i) } - #define atomic_set(v, i) ((v)->counter = (i)) - -The first macro is used in definitions, such as: - -static atomic_t my_counter = ATOMIC_INIT(1); - -The initializer is atomic in that the return values of the atomic operations -are guaranteed to be correct reflecting the initialized value if the -initializer is used before runtime. If the initializer is used at runtime, a -proper implicit or explicit read memory barrier is needed before reading the -value with atomic_read from another thread. - -As with all of the atomic_ interfaces, replace the leading "atomic_" -with "atomic_long_" to operate on atomic_long_t. - -The second interface can be used at runtime, as in: - - struct foo { atomic_t counter; }; - ... - - struct foo *k; - - k = kmalloc(sizeof(*k), GFP_KERNEL); - if (!k) - return -ENOMEM; - atomic_set(&k->counter, 0); - -The setting is atomic in that the return values of the atomic operations by -all threads are guaranteed to be correct reflecting either the value that has -been set with this operation or set with another operation. A proper implicit -or explicit memory barrier is needed before the value set with the operation -is guaranteed to be readable with atomic_read from another thread. - -Next, we have: - - #define atomic_read(v) ((v)->counter) - -which simply reads the counter value currently visible to the calling thread. -The read is atomic in that the return value is guaranteed to be one of the -values initialized or modified with the interface operations if a proper -implicit or explicit memory barrier is used after possible runtime -initialization by any other thread and the value is modified only with the -interface operations. atomic_read does not guarantee that the runtime -initialization by any other thread is visible yet, so the user of the -interface must take care of that with a proper implicit or explicit memory -barrier. - -*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! *** - -Some architectures may choose to use the volatile keyword, barriers, or inline -assembly to guarantee some degree of immediacy for atomic_read() and -atomic_set(). This is not uniformly guaranteed, and may change in the future, -so all users of atomic_t should treat atomic_read() and atomic_set() as simple -C statements that may be reordered or optimized away entirely by the compiler -or processor, and explicitly invoke the appropriate compiler and/or memory -barrier for each use case. Failure to do so will result in code that may -suddenly break when used with different architectures or compiler -optimizations, or even changes in unrelated code which changes how the -compiler optimizes the section accessing atomic_t variables. - -*** YOU HAVE BEEN WARNED! *** - -Properly aligned pointers, longs, ints, and chars (and unsigned -equivalents) may be atomically loaded from and stored to in the same -sense as described for atomic_read() and atomic_set(). The READ_ONCE() -and WRITE_ONCE() macros should be used to prevent the compiler from using -optimizations that might otherwise optimize accesses out of existence on -the one hand, or that might create unsolicited accesses on the other. - -For example consider the following code: - - while (a > 0) - do_something(); - -If the compiler can prove that do_something() does not store to the -variable a, then the compiler is within its rights transforming this to -the following: - - tmp = a; - if (a > 0) - for (;;) - do_something(); - -If you don't want the compiler to do this (and you probably don't), then -you should use something like the following: - - while (READ_ONCE(a) < 0) - do_something(); - -Alternatively, you could place a barrier() call in the loop. - -For another example, consider the following code: - - tmp_a = a; - do_something_with(tmp_a); - do_something_else_with(tmp_a); - -If the compiler can prove that do_something_with() does not store to the -variable a, then the compiler is within its rights to manufacture an -additional load as follows: - - tmp_a = a; - do_something_with(tmp_a); - tmp_a = a; - do_something_else_with(tmp_a); - -This could fatally confuse your code if it expected the same value -to be passed to do_something_with() and do_something_else_with(). - -The compiler would be likely to manufacture this additional load if -do_something_with() was an inline function that made very heavy use -of registers: reloading from variable a could save a flush to the -stack and later reload. To prevent the compiler from attacking your -code in this manner, write the following: - - tmp_a = READ_ONCE(a); - do_something_with(tmp_a); - do_something_else_with(tmp_a); - -For a final example, consider the following code, assuming that the -variable a is set at boot time before the second CPU is brought online -and never changed later, so that memory barriers are not needed: - - if (a) - b = 9; - else - b = 42; - -The compiler is within its rights to manufacture an additional store -by transforming the above code into the following: - - b = 42; - if (a) - b = 9; - -This could come as a fatal surprise to other code running concurrently -that expected b to never have the value 42 if a was zero. To prevent -the compiler from doing this, write something like: - - if (a) - WRITE_ONCE(b, 9); - else - WRITE_ONCE(b, 42); - -Don't even -think- about doing this without proper use of memory barriers, -locks, or atomic operations if variable a can change at runtime! - -*** WARNING: READ_ONCE() OR WRITE_ONCE() DO NOT IMPLY A BARRIER! *** - -Now, we move onto the atomic operation interfaces typically implemented with -the help of assembly code. - - void atomic_add(int i, atomic_t *v); - void atomic_sub(int i, atomic_t *v); - void atomic_inc(atomic_t *v); - void atomic_dec(atomic_t *v); - -These four routines add and subtract integral values to/from the given -atomic_t value. The first two routines pass explicit integers by -which to make the adjustment, whereas the latter two use an implicit -adjustment value of "1". - -One very important aspect of these two routines is that they DO NOT -require any explicit memory barriers. They need only perform the -atomic_t counter update in an SMP safe manner. - -Next, we have: - - int atomic_inc_return(atomic_t *v); - int atomic_dec_return(atomic_t *v); - -These routines add 1 and subtract 1, respectively, from the given -atomic_t and return the new counter value after the operation is -performed. - -Unlike the above routines, it is required that these primitives -include explicit memory barriers that are performed before and after -the operation. It must be done such that all memory operations before -and after the atomic operation calls are strongly ordered with respect -to the atomic operation itself. - -For example, it should behave as if a smp_mb() call existed both -before and after the atomic operation. - -If the atomic instructions used in an implementation provide explicit -memory barrier semantics which satisfy the above requirements, that is -fine as well. - -Let's move on: - - int atomic_add_return(int i, atomic_t *v); - int atomic_sub_return(int i, atomic_t *v); - -These behave just like atomic_{inc,dec}_return() except that an -explicit counter adjustment is given instead of the implicit "1". -This means that like atomic_{inc,dec}_return(), the memory barrier -semantics are required. - -Next: - - int atomic_inc_and_test(atomic_t *v); - int atomic_dec_and_test(atomic_t *v); - -These two routines increment and decrement by 1, respectively, the -given atomic counter. They return a boolean indicating whether the -resulting counter value was zero or not. - -Again, these primitives provide explicit memory barrier semantics around -the atomic operation. - - int atomic_sub_and_test(int i, atomic_t *v); - -This is identical to atomic_dec_and_test() except that an explicit -decrement is given instead of the implicit "1". This primitive must -provide explicit memory barrier semantics around the operation. - - int atomic_add_negative(int i, atomic_t *v); - -The given increment is added to the given atomic counter value. A boolean -is return which indicates whether the resulting counter value is negative. -This primitive must provide explicit memory barrier semantics around -the operation. - -Then: - - int atomic_xchg(atomic_t *v, int new); - -This performs an atomic exchange operation on the atomic variable v, setting -the given new value. It returns the old value that the atomic variable v had -just before the operation. - -atomic_xchg must provide explicit memory barriers around the operation. - - int atomic_cmpxchg(atomic_t *v, int old, int new); - -This performs an atomic compare exchange operation on the atomic value v, -with the given old and new values. Like all atomic_xxx operations, -atomic_cmpxchg will only satisfy its atomicity semantics as long as all -other accesses of *v are performed through atomic_xxx operations. - -atomic_cmpxchg must provide explicit memory barriers around the operation, -although if the comparison fails then no memory ordering guarantees are -required. - -The semantics for atomic_cmpxchg are the same as those defined for 'cas' -below. - -Finally: - - int atomic_add_unless(atomic_t *v, int a, int u); - -If the atomic value v is not equal to u, this function adds a to v, and -returns non zero. If v is equal to u then it returns zero. This is done as -an atomic operation. - -atomic_add_unless must provide explicit memory barriers around the -operation unless it fails (returns 0). - -atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) - - -If a caller requires memory barrier semantics around an atomic_t -operation which does not return a value, a set of interfaces are -defined which accomplish this: - - void smp_mb__before_atomic(void); - void smp_mb__after_atomic(void); - -For example, smp_mb__before_atomic() can be used like so: - - obj->dead = 1; - smp_mb__before_atomic(); - atomic_dec(&obj->ref_count); - -It makes sure that all memory operations preceding the atomic_dec() -call are strongly ordered with respect to the atomic counter -operation. In the above example, it guarantees that the assignment of -"1" to obj->dead will be globally visible to other cpus before the -atomic counter decrement. - -Without the explicit smp_mb__before_atomic() call, the -implementation could legally allow the atomic counter update visible -to other cpus before the "obj->dead = 1;" assignment. - -A missing memory barrier in the cases where they are required by the -atomic_t implementation above can have disastrous results. Here is -an example, which follows a pattern occurring frequently in the Linux -kernel. It is the use of atomic counters to implement reference -counting, and it works such that once the counter falls to zero it can -be guaranteed that no other entity can be accessing the object: - -static void obj_list_add(struct obj *obj, struct list_head *head) -{ - obj->active = 1; - list_add(&obj->list, head); -} - -static void obj_list_del(struct obj *obj) -{ - list_del(&obj->list); - obj->active = 0; -} - -static void obj_destroy(struct obj *obj) -{ - BUG_ON(obj->active); - kfree(obj); -} - -struct obj *obj_list_peek(struct list_head *head) -{ - if (!list_empty(head)) { - struct obj *obj; - - obj = list_entry(head->next, struct obj, list); - atomic_inc(&obj->refcnt); - return obj; - } - return NULL; -} - -void obj_poke(void) -{ - struct obj *obj; - - spin_lock(&global_list_lock); - obj = obj_list_peek(&global_list); - spin_unlock(&global_list_lock); - - if (obj) { - obj->ops->poke(obj); - if (atomic_dec_and_test(&obj->refcnt)) - obj_destroy(obj); - } -} - -void obj_timeout(struct obj *obj) -{ - spin_lock(&global_list_lock); - obj_list_del(obj); - spin_unlock(&global_list_lock); - - if (atomic_dec_and_test(&obj->refcnt)) - obj_destroy(obj); -} - -(This is a simplification of the ARP queue management in the - generic neighbour discover code of the networking. Olaf Kirch - found a bug wrt. memory barriers in kfree_skb() that exposed - the atomic_t memory barrier requirements quite clearly.) - -Given the above scheme, it must be the case that the obj->active -update done by the obj list deletion be visible to other processors -before the atomic counter decrement is performed. - -Otherwise, the counter could fall to zero, yet obj->active would still -be set, thus triggering the assertion in obj_destroy(). The error -sequence looks like this: - - cpu 0 cpu 1 - obj_poke() obj_timeout() - obj = obj_list_peek(); - ... gains ref to obj, refcnt=2 - obj_list_del(obj); - obj->active = 0 ... - ... visibility delayed ... - atomic_dec_and_test() - ... refcnt drops to 1 ... - atomic_dec_and_test() - ... refcount drops to 0 ... - obj_destroy() - BUG() triggers since obj->active - still seen as one - obj->active update visibility occurs - -With the memory barrier semantics required of the atomic_t operations -which return values, the above sequence of memory visibility can never -happen. Specifically, in the above case the atomic_dec_and_test() -counter decrement would not become globally visible until the -obj->active update does. - -As a historical note, 32-bit Sparc used to only allow usage of -24-bits of its atomic_t type. This was because it used 8 bits -as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" -type instruction. However, 32-bit Sparc has since been moved over -to a "hash table of spinlocks" scheme, that allows the full 32-bit -counter to be realized. Essentially, an array of spinlocks are -indexed into based upon the address of the atomic_t being operated -on, and that lock protects the atomic operation. Parisc uses the -same scheme. - -Another note is that the atomic_t operations returning values are -extremely slow on an old 386. - -We will now cover the atomic bitmask operations. You will find that -their SMP and memory barrier semantics are similar in shape and scope -to the atomic_t ops above. - -Native atomic bit operations are defined to operate on objects aligned -to the size of an "unsigned long" C data type, and are least of that -size. The endianness of the bits within each "unsigned long" are the -native endianness of the cpu. - - void set_bit(unsigned long nr, volatile unsigned long *addr); - void clear_bit(unsigned long nr, volatile unsigned long *addr); - void change_bit(unsigned long nr, volatile unsigned long *addr); - -These routines set, clear, and change, respectively, the bit number -indicated by "nr" on the bit mask pointed to by "ADDR". - -They must execute atomically, yet there are no implicit memory barrier -semantics required of these interfaces. - - int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); - int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); - int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); - -Like the above, except that these routines return a boolean which -indicates whether the changed bit was set _BEFORE_ the atomic bit -operation. - -WARNING! It is incredibly important that the value be a boolean, -ie. "0" or "1". Do not try to be fancy and save a few instructions by -declaring the above to return "long" and just returning something like -"old_val & mask" because that will not work. - -For one thing, this return value gets truncated to int in many code -paths using these interfaces, so on 64-bit if the bit is set in the -upper 32-bits then testers will never see that. - -One great example of where this problem crops up are the thread_info -flag operations. Routines such as test_and_set_ti_thread_flag() chop -the return value into an int. There are other places where things -like this occur as well. - -These routines, like the atomic_t counter operations returning values, -must provide explicit memory barrier semantics around their execution. -All memory operations before the atomic bit operation call must be -made visible globally before the atomic bit operation is made visible. -Likewise, the atomic bit operation must be visible globally before any -subsequent memory operation is made visible. For example: - - obj->dead = 1; - if (test_and_set_bit(0, &obj->flags)) - /* ... */; - obj->killed = 1; - -The implementation of test_and_set_bit() must guarantee that -"obj->dead = 1;" is visible to cpus before the atomic memory operation -done by test_and_set_bit() becomes visible. Likewise, the atomic -memory operation done by test_and_set_bit() must become visible before -"obj->killed = 1;" is visible. - -Finally there is the basic operation: - - int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); - -Which returns a boolean indicating if bit "nr" is set in the bitmask -pointed to by "addr". - -If explicit memory barriers are required around {set,clear}_bit() (which do -not return a value, and thus does not need to provide memory barrier -semantics), two interfaces are provided: - - void smp_mb__before_atomic(void); - void smp_mb__after_atomic(void); - -They are used as follows, and are akin to their atomic_t operation -brothers: - - /* All memory operations before this call will - * be globally visible before the clear_bit(). - */ - smp_mb__before_atomic(); - clear_bit( ... ); - - /* The clear_bit() will be visible before all - * subsequent memory operations. - */ - smp_mb__after_atomic(); - -There are two special bitops with lock barrier semantics (acquire/release, -same as spinlocks). These operate in the same way as their non-_lock/unlock -postfixed variants, except that they are to provide acquire/release semantics, -respectively. This means they can be used for bit_spin_trylock and -bit_spin_unlock type operations without specifying any more barriers. - - int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); - void clear_bit_unlock(unsigned long nr, unsigned long *addr); - void __clear_bit_unlock(unsigned long nr, unsigned long *addr); - -The __clear_bit_unlock version is non-atomic, however it still implements -unlock barrier semantics. This can be useful if the lock itself is protecting -the other bits in the word. - -Finally, there are non-atomic versions of the bitmask operations -provided. They are used in contexts where some other higher-level SMP -locking scheme is being used to protect the bitmask, and thus less -expensive non-atomic operations may be used in the implementation. -They have names similar to the above bitmask operation interfaces, -except that two underscores are prefixed to the interface name. - - void __set_bit(unsigned long nr, volatile unsigned long *addr); - void __clear_bit(unsigned long nr, volatile unsigned long *addr); - void __change_bit(unsigned long nr, volatile unsigned long *addr); - int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); - int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); - int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); - -These non-atomic variants also do not require any special memory -barrier semantics. - -The routines xchg() and cmpxchg() must provide the same exact -memory-barrier semantics as the atomic and bit operations returning -values. - -Note: If someone wants to use xchg(), cmpxchg() and their variants, -linux/atomic.h should be included rather than asm/cmpxchg.h, unless -the code is in arch/* and can take care of itself. - -Spinlocks and rwlocks have memory barrier expectations as well. -The rule to follow is simple: - -1) When acquiring a lock, the implementation must make it globally - visible before any subsequent memory operation. - -2) When releasing a lock, the implementation must make it such that - all previous memory operations are globally visible before the - lock release. - -Which finally brings us to _atomic_dec_and_lock(). There is an -architecture-neutral version implemented in lib/dec_and_lock.c, -but most platforms will wish to optimize this in assembler. - - int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); - -Atomically decrement the given counter, and if will drop to zero -atomically acquire the given spinlock and perform the decrement -of the counter to zero. If it does not drop to zero, do nothing -with the spinlock. - -It is actually pretty simple to get the memory barrier correct. -Simply satisfy the spinlock grab requirements, which is make -sure the spinlock operation is globally visible before any -subsequent memory operation. - -We can demonstrate this operation more clearly if we define -an abstract atomic operation: - - long cas(long *mem, long old, long new); - -"cas" stands for "compare and swap". It atomically: - -1) Compares "old" with the value currently at "mem". -2) If they are equal, "new" is written to "mem". -3) Regardless, the current value at "mem" is returned. - -As an example usage, here is what an atomic counter update -might look like: - -void example_atomic_inc(long *counter) -{ - long old, new, ret; - - while (1) { - old = *counter; - new = old + 1; - - ret = cas(counter, old, new); - if (ret == old) - break; - } -} - -Let's use cas() in order to build a pseudo-C atomic_dec_and_lock(): - -int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) -{ - long old, new, ret; - int went_to_zero; - - went_to_zero = 0; - while (1) { - old = atomic_read(atomic); - new = old - 1; - if (new == 0) { - went_to_zero = 1; - spin_lock(lock); - } - ret = cas(atomic, old, new); - if (ret == old) - break; - if (went_to_zero) { - spin_unlock(lock); - went_to_zero = 0; - } - } - - return went_to_zero; -} - -Now, as far as memory barriers go, as long as spin_lock() -strictly orders all subsequent memory operations (including -the cas()) with respect to itself, things will be fine. - -Said another way, _atomic_dec_and_lock() must guarantee that -a counter dropping to zero is never made visible before the -spinlock being acquired. - -Note that this also means that for the case where the counter -is not dropping to zero, there are no memory ordering -requirements. diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst new file mode 100644 index 000000000000..55e43f1c80de --- /dev/null +++ b/Documentation/core-api/atomic_ops.rst @@ -0,0 +1,658 @@ +======================================================= +Semantics and Behavior of Atomic and Bitmask Operations +======================================================= + +:Author: David S. Miller + +This document is intended to serve as a guide to Linux port +maintainers on how to implement atomic counter, bitops, and spinlock +interfaces properly. + +Atomic Type And Operations +========================== + +The atomic_t type should be defined as a signed integer and +the atomic_long_t type as a signed long integer. Also, they should +be made opaque such that any kind of cast to a normal C integer type +will fail. Something like the following should suffice:: + + typedef struct { int counter; } atomic_t; + typedef struct { long counter; } atomic_long_t; + +Historically, counter has been declared volatile. This is now discouraged. +See :ref:`Documentation/process/volatile-considered-harmful.rst +` for the complete rationale. + +local_t is very similar to atomic_t. If the counter is per CPU and only +updated by one CPU, local_t is probably more appropriate. Please see +:ref:`Documentation/core-api/local_ops.rst ` for the semantics of +local_t. + +The first operations to implement for atomic_t's are the initializers and +plain reads. :: + + #define ATOMIC_INIT(i) { (i) } + #define atomic_set(v, i) ((v)->counter = (i)) + +The first macro is used in definitions, such as:: + + static atomic_t my_counter = ATOMIC_INIT(1); + +The initializer is atomic in that the return values of the atomic operations +are guaranteed to be correct reflecting the initialized value if the +initializer is used before runtime. If the initializer is used at runtime, a +proper implicit or explicit read memory barrier is needed before reading the +value with atomic_read from another thread. + +As with all of the ``atomic_`` interfaces, replace the leading ``atomic_`` +with ``atomic_long_`` to operate on atomic_long_t. + +The second interface can be used at runtime, as in:: + + struct foo { atomic_t counter; }; + ... + + struct foo *k; + + k = kmalloc(sizeof(*k), GFP_KERNEL); + if (!k) + return -ENOMEM; + atomic_set(&k->counter, 0); + +The setting is atomic in that the return values of the atomic operations by +all threads are guaranteed to be correct reflecting either the value that has +been set with this operation or set with another operation. A proper implicit +or explicit memory barrier is needed before the value set with the operation +is guaranteed to be readable with atomic_read from another thread. + +Next, we have:: + + #define atomic_read(v) ((v)->counter) + +which simply reads the counter value currently visible to the calling thread. +The read is atomic in that the return value is guaranteed to be one of the +values initialized or modified with the interface operations if a proper +implicit or explicit memory barrier is used after possible runtime +initialization by any other thread and the value is modified only with the +interface operations. atomic_read does not guarantee that the runtime +initialization by any other thread is visible yet, so the user of the +interface must take care of that with a proper implicit or explicit memory +barrier. + +.. warning:: + + ``atomic_read()`` and ``atomic_set()`` DO NOT IMPLY BARRIERS! + + Some architectures may choose to use the volatile keyword, barriers, or + inline assembly to guarantee some degree of immediacy for atomic_read() + and atomic_set(). This is not uniformly guaranteed, and may change in + the future, so all users of atomic_t should treat atomic_read() and + atomic_set() as simple C statements that may be reordered or optimized + away entirely by the compiler or processor, and explicitly invoke the + appropriate compiler and/or memory barrier for each use case. Failure + to do so will result in code that may suddenly break when used with + different architectures or compiler optimizations, or even changes in + unrelated code which changes how the compiler optimizes the section + accessing atomic_t variables. + +Properly aligned pointers, longs, ints, and chars (and unsigned +equivalents) may be atomically loaded from and stored to in the same +sense as described for atomic_read() and atomic_set(). The READ_ONCE() +and WRITE_ONCE() macros should be used to prevent the compiler from using +optimizations that might otherwise optimize accesses out of existence on +the one hand, or that might create unsolicited accesses on the other. + +For example consider the following code:: + + while (a > 0) + do_something(); + +If the compiler can prove that do_something() does not store to the +variable a, then the compiler is within its rights transforming this to +the following:: + + tmp = a; + if (a > 0) + for (;;) + do_something(); + +If you don't want the compiler to do this (and you probably don't), then +you should use something like the following:: + + while (READ_ONCE(a) < 0) + do_something(); + +Alternatively, you could place a barrier() call in the loop. + +For another example, consider the following code:: + + tmp_a = a; + do_something_with(tmp_a); + do_something_else_with(tmp_a); + +If the compiler can prove that do_something_with() does not store to the +variable a, then the compiler is within its rights to manufacture an +additional load as follows:: + + tmp_a = a; + do_something_with(tmp_a); + tmp_a = a; + do_something_else_with(tmp_a); + +This could fatally confuse your code if it expected the same value +to be passed to do_something_with() and do_something_else_with(). + +The compiler would be likely to manufacture this additional load if +do_something_with() was an inline function that made very heavy use +of registers: reloading from variable a could save a flush to the +stack and later reload. To prevent the compiler from attacking your +code in this manner, write the following:: + + tmp_a = READ_ONCE(a); + do_something_with(tmp_a); + do_something_else_with(tmp_a); + +For a final example, consider the following code, assuming that the +variable a is set at boot time before the second CPU is brought online +and never changed later, so that memory barriers are not needed:: + + if (a) + b = 9; + else + b = 42; + +The compiler is within its rights to manufacture an additional store +by transforming the above code into the following:: + + b = 42; + if (a) + b = 9; + +This could come as a fatal surprise to other code running concurrently +that expected b to never have the value 42 if a was zero. To prevent +the compiler from doing this, write something like:: + + if (a) + WRITE_ONCE(b, 9); + else + WRITE_ONCE(b, 42); + +Don't even -think- about doing this without proper use of memory barriers, +locks, or atomic operations if variable a can change at runtime! + +.. warning:: + + ``READ_ONCE()`` OR ``WRITE_ONCE()`` DO NOT IMPLY A BARRIER! + +Now, we move onto the atomic operation interfaces typically implemented with +the help of assembly code. :: + + void atomic_add(int i, atomic_t *v); + void atomic_sub(int i, atomic_t *v); + void atomic_inc(atomic_t *v); + void atomic_dec(atomic_t *v); + +These four routines add and subtract integral values to/from the given +atomic_t value. The first two routines pass explicit integers by +which to make the adjustment, whereas the latter two use an implicit +adjustment value of "1". + +One very important aspect of these two routines is that they DO NOT +require any explicit memory barriers. They need only perform the +atomic_t counter update in an SMP safe manner. + +Next, we have:: + + int atomic_inc_return(atomic_t *v); + int atomic_dec_return(atomic_t *v); + +These routines add 1 and subtract 1, respectively, from the given +atomic_t and return the new counter value after the operation is +performed. + +Unlike the above routines, it is required that these primitives +include explicit memory barriers that are performed before and after +the operation. It must be done such that all memory operations before +and after the atomic operation calls are strongly ordered with respect +to the atomic operation itself. + +For example, it should behave as if a smp_mb() call existed both +before and after the atomic operation. + +If the atomic instructions used in an implementation provide explicit +memory barrier semantics which satisfy the above requirements, that is +fine as well. + +Let's move on:: + + int atomic_add_return(int i, atomic_t *v); + int atomic_sub_return(int i, atomic_t *v); + +These behave just like atomic_{inc,dec}_return() except that an +explicit counter adjustment is given instead of the implicit "1". +This means that like atomic_{inc,dec}_return(), the memory barrier +semantics are required. + +Next:: + + int atomic_inc_and_test(atomic_t *v); + int atomic_dec_and_test(atomic_t *v); + +These two routines increment and decrement by 1, respectively, the +given atomic counter. They return a boolean indicating whether the +resulting counter value was zero or not. + +Again, these primitives provide explicit memory barrier semantics around +the atomic operation:: + + int atomic_sub_and_test(int i, atomic_t *v); + +This is identical to atomic_dec_and_test() except that an explicit +decrement is given instead of the implicit "1". This primitive must +provide explicit memory barrier semantics around the operation:: + + int atomic_add_negative(int i, atomic_t *v); + +The given increment is added to the given atomic counter value. A boolean +is return which indicates whether the resulting counter value is negative. +This primitive must provide explicit memory barrier semantics around +the operation. + +Then:: + + int atomic_xchg(atomic_t *v, int new); + +This performs an atomic exchange operation on the atomic variable v, setting +the given new value. It returns the old value that the atomic variable v had +just before the operation. + +atomic_xchg must provide explicit memory barriers around the operation. :: + + int atomic_cmpxchg(atomic_t *v, int old, int new); + +This performs an atomic compare exchange operation on the atomic value v, +with the given old and new values. Like all atomic_xxx operations, +atomic_cmpxchg will only satisfy its atomicity semantics as long as all +other accesses of \*v are performed through atomic_xxx operations. + +atomic_cmpxchg must provide explicit memory barriers around the operation, +although if the comparison fails then no memory ordering guarantees are +required. + +The semantics for atomic_cmpxchg are the same as those defined for 'cas' +below. + +Finally:: + + int atomic_add_unless(atomic_t *v, int a, int u); + +If the atomic value v is not equal to u, this function adds a to v, and +returns non zero. If v is equal to u then it returns zero. This is done as +an atomic operation. + +atomic_add_unless must provide explicit memory barriers around the +operation unless it fails (returns 0). + +atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) + + +If a caller requires memory barrier semantics around an atomic_t +operation which does not return a value, a set of interfaces are +defined which accomplish this:: + + void smp_mb__before_atomic(void); + void smp_mb__after_atomic(void); + +For example, smp_mb__before_atomic() can be used like so:: + + obj->dead = 1; + smp_mb__before_atomic(); + atomic_dec(&obj->ref_count); + +It makes sure that all memory operations preceding the atomic_dec() +call are strongly ordered with respect to the atomic counter +operation. In the above example, it guarantees that the assignment of +"1" to obj->dead will be globally visible to other cpus before the +atomic counter decrement. + +Without the explicit smp_mb__before_atomic() call, the +implementation could legally allow the atomic counter update visible +to other cpus before the "obj->dead = 1;" assignment. + +A missing memory barrier in the cases where they are required by the +atomic_t implementation above can have disastrous results. Here is +an example, which follows a pattern occurring frequently in the Linux +kernel. It is the use of atomic counters to implement reference +counting, and it works such that once the counter falls to zero it can +be guaranteed that no other entity can be accessing the object:: + + static void obj_list_add(struct obj *obj, struct list_head *head) + { + obj->active = 1; + list_add(&obj->list, head); + } + + static void obj_list_del(struct obj *obj) + { + list_del(&obj->list); + obj->active = 0; + } + + static void obj_destroy(struct obj *obj) + { + BUG_ON(obj->active); + kfree(obj); + } + + struct obj *obj_list_peek(struct list_head *head) + { + if (!list_empty(head)) { + struct obj *obj; + + obj = list_entry(head->next, struct obj, list); + atomic_inc(&obj->refcnt); + return obj; + } + return NULL; + } + + void obj_poke(void) + { + struct obj *obj; + + spin_lock(&global_list_lock); + obj = obj_list_peek(&global_list); + spin_unlock(&global_list_lock); + + if (obj) { + obj->ops->poke(obj); + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } + } + + void obj_timeout(struct obj *obj) + { + spin_lock(&global_list_lock); + obj_list_del(obj); + spin_unlock(&global_list_lock); + + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } + +.. note:: + + This is a simplification of the ARP queue management in the generic + neighbour discover code of the networking. Olaf Kirch found a bug wrt. + memory barriers in kfree_skb() that exposed the atomic_t memory barrier + requirements quite clearly. + +Given the above scheme, it must be the case that the obj->active +update done by the obj list deletion be visible to other processors +before the atomic counter decrement is performed. + +Otherwise, the counter could fall to zero, yet obj->active would still +be set, thus triggering the assertion in obj_destroy(). The error +sequence looks like this:: + + cpu 0 cpu 1 + obj_poke() obj_timeout() + obj = obj_list_peek(); + ... gains ref to obj, refcnt=2 + obj_list_del(obj); + obj->active = 0 ... + ... visibility delayed ... + atomic_dec_and_test() + ... refcnt drops to 1 ... + atomic_dec_and_test() + ... refcount drops to 0 ... + obj_destroy() + BUG() triggers since obj->active + still seen as one + obj->active update visibility occurs + +With the memory barrier semantics required of the atomic_t operations +which return values, the above sequence of memory visibility can never +happen. Specifically, in the above case the atomic_dec_and_test() +counter decrement would not become globally visible until the +obj->active update does. + +As a historical note, 32-bit Sparc used to only allow usage of +24-bits of its atomic_t type. This was because it used 8 bits +as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" +type instruction. However, 32-bit Sparc has since been moved over +to a "hash table of spinlocks" scheme, that allows the full 32-bit +counter to be realized. Essentially, an array of spinlocks are +indexed into based upon the address of the atomic_t being operated +on, and that lock protects the atomic operation. Parisc uses the +same scheme. + +Another note is that the atomic_t operations returning values are +extremely slow on an old 386. + + +Atomic Bitmask +============== + +We will now cover the atomic bitmask operations. You will find that +their SMP and memory barrier semantics are similar in shape and scope +to the atomic_t ops above. + +Native atomic bit operations are defined to operate on objects aligned +to the size of an "unsigned long" C data type, and are least of that +size. The endianness of the bits within each "unsigned long" are the +native endianness of the cpu. :: + + void set_bit(unsigned long nr, volatile unsigned long *addr); + void clear_bit(unsigned long nr, volatile unsigned long *addr); + void change_bit(unsigned long nr, volatile unsigned long *addr); + +These routines set, clear, and change, respectively, the bit number +indicated by "nr" on the bit mask pointed to by "ADDR". + +They must execute atomically, yet there are no implicit memory barrier +semantics required of these interfaces. :: + + int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +Like the above, except that these routines return a boolean which +indicates whether the changed bit was set _BEFORE_ the atomic bit +operation. + +WARNING! It is incredibly important that the value be a boolean, +ie. "0" or "1". Do not try to be fancy and save a few instructions by +declaring the above to return "long" and just returning something like +"old_val & mask" because that will not work. + +For one thing, this return value gets truncated to int in many code +paths using these interfaces, so on 64-bit if the bit is set in the +upper 32-bits then testers will never see that. + +One great example of where this problem crops up are the thread_info +flag operations. Routines such as test_and_set_ti_thread_flag() chop +the return value into an int. There are other places where things +like this occur as well. + +These routines, like the atomic_t counter operations returning values, +must provide explicit memory barrier semantics around their execution. +All memory operations before the atomic bit operation call must be +made visible globally before the atomic bit operation is made visible. +Likewise, the atomic bit operation must be visible globally before any +subsequent memory operation is made visible. For example:: + + obj->dead = 1; + if (test_and_set_bit(0, &obj->flags)) + /* ... */; + obj->killed = 1; + +The implementation of test_and_set_bit() must guarantee that +"obj->dead = 1;" is visible to cpus before the atomic memory operation +done by test_and_set_bit() becomes visible. Likewise, the atomic +memory operation done by test_and_set_bit() must become visible before +"obj->killed = 1;" is visible. + +Finally there is the basic operation:: + + int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); + +Which returns a boolean indicating if bit "nr" is set in the bitmask +pointed to by "addr". + +If explicit memory barriers are required around {set,clear}_bit() (which do +not return a value, and thus does not need to provide memory barrier +semantics), two interfaces are provided:: + + void smp_mb__before_atomic(void); + void smp_mb__after_atomic(void); + +They are used as follows, and are akin to their atomic_t operation +brothers:: + + /* All memory operations before this call will + * be globally visible before the clear_bit(). + */ + smp_mb__before_atomic(); + clear_bit( ... ); + + /* The clear_bit() will be visible before all + * subsequent memory operations. + */ + smp_mb__after_atomic(); + +There are two special bitops with lock barrier semantics (acquire/release, +same as spinlocks). These operate in the same way as their non-_lock/unlock +postfixed variants, except that they are to provide acquire/release semantics, +respectively. This means they can be used for bit_spin_trylock and +bit_spin_unlock type operations without specifying any more barriers. :: + + int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); + void clear_bit_unlock(unsigned long nr, unsigned long *addr); + void __clear_bit_unlock(unsigned long nr, unsigned long *addr); + +The __clear_bit_unlock version is non-atomic, however it still implements +unlock barrier semantics. This can be useful if the lock itself is protecting +the other bits in the word. + +Finally, there are non-atomic versions of the bitmask operations +provided. They are used in contexts where some other higher-level SMP +locking scheme is being used to protect the bitmask, and thus less +expensive non-atomic operations may be used in the implementation. +They have names similar to the above bitmask operation interfaces, +except that two underscores are prefixed to the interface name. :: + + void __set_bit(unsigned long nr, volatile unsigned long *addr); + void __clear_bit(unsigned long nr, volatile unsigned long *addr); + void __change_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +These non-atomic variants also do not require any special memory +barrier semantics. + +The routines xchg() and cmpxchg() must provide the same exact +memory-barrier semantics as the atomic and bit operations returning +values. + +.. note:: + + If someone wants to use xchg(), cmpxchg() and their variants, + linux/atomic.h should be included rather than asm/cmpxchg.h, unless the + code is in arch/* and can take care of itself. + +Spinlocks and rwlocks have memory barrier expectations as well. +The rule to follow is simple: + +1) When acquiring a lock, the implementation must make it globally + visible before any subsequent memory operation. + +2) When releasing a lock, the implementation must make it such that + all previous memory operations are globally visible before the + lock release. + +Which finally brings us to _atomic_dec_and_lock(). There is an +architecture-neutral version implemented in lib/dec_and_lock.c, +but most platforms will wish to optimize this in assembler. :: + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); + +Atomically decrement the given counter, and if will drop to zero +atomically acquire the given spinlock and perform the decrement +of the counter to zero. If it does not drop to zero, do nothing +with the spinlock. + +It is actually pretty simple to get the memory barrier correct. +Simply satisfy the spinlock grab requirements, which is make +sure the spinlock operation is globally visible before any +subsequent memory operation. + +We can demonstrate this operation more clearly if we define +an abstract atomic operation:: + + long cas(long *mem, long old, long new); + +"cas" stands for "compare and swap". It atomically: + +1) Compares "old" with the value currently at "mem". +2) If they are equal, "new" is written to "mem". +3) Regardless, the current value at "mem" is returned. + +As an example usage, here is what an atomic counter update +might look like:: + + void example_atomic_inc(long *counter) + { + long old, new, ret; + + while (1) { + old = *counter; + new = old + 1; + + ret = cas(counter, old, new); + if (ret == old) + break; + } + } + +Let's use cas() in order to build a pseudo-C atomic_dec_and_lock():: + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) + { + long old, new, ret; + int went_to_zero; + + went_to_zero = 0; + while (1) { + old = atomic_read(atomic); + new = old - 1; + if (new == 0) { + went_to_zero = 1; + spin_lock(lock); + } + ret = cas(atomic, old, new); + if (ret == old) + break; + if (went_to_zero) { + spin_unlock(lock); + went_to_zero = 0; + } + } + + return went_to_zero; + } + +Now, as far as memory barriers go, as long as spin_lock() +strictly orders all subsequent memory operations (including +the cas()) with respect to itself, things will be fine. + +Said another way, _atomic_dec_and_lock() must guarantee that +a counter dropping to zero is never made visible before the +spinlock being acquired. + +.. note:: + + Note that this also means that for the case where the counter is not + dropping to zero, there are no memory ordering requirements. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index f53555ee4931..25b4e4a4d017 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -8,6 +8,7 @@ Kernel and driver related documentation. :maxdepth: 1 assoc_array + atomic_ops local_ops workqueue diff --git a/Documentation/process/volatile-considered-harmful.rst b/Documentation/process/volatile-considered-harmful.rst index e0d042af386c..4934e656a6f3 100644 --- a/Documentation/process/volatile-considered-harmful.rst +++ b/Documentation/process/volatile-considered-harmful.rst @@ -1,3 +1,6 @@ + +.. _volatile_considered_harmful: + Why the "volatile" type class should not be used ------------------------------------------------ -- cgit v1.2.3