Mardi, 17. Avril 2012
Dr.Dobb's articles by Herb Sutter :
- Free lunch is over by Herb Sutter
- Design for Manycore Systems by Herb Sutter
- Lock-Free Queues by Petru Marginean
- Lock-Free Code: A False Sense of Security by Herb Sutter
- Writing Lock-Free Code: A Corrected Queue by Herb Sutter
- Writing a Generalized Concurrent Queue by Herb Sutter
- Measuring Parallel Performance: Optimizing a Concurrent Queue by Herb Sutter
- Understanding Parallel Performance by Herb Sutter
Dr.Dobb's articles by Andrei Alexandrescu :
- Lock-Free Data Structures by Andrei Alexandrescu
- Lock-Free Data Structures with Hazard Pointers by Andrei Alexandrescu
Articles by Bartosz Milewski :
- Multicores and Publication Safety
- Who ordered memory fences on an x86?
- Who ordered sequential consistency?
- C++ atomics and memory ordering
- The inscrutable c++ memory model
Theoretical papers :
- CAS-Based Lock-Free Algorithm for Shared Deques by Maged M. Michael
- Obstruction-Free Synchronization: Double-Ended Queues as an Example by Maurice Herlihy, Victor Luchangco and Mark Moir
- Lock-free Dynamically Resizable Arrays
- Lock-Free and Practical Deques and Doubly Linked Lists using Single-Word Compare-And-Swap1
- An Optimistic Approach to Lock-Free FIFO Queues
- Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems
- Split-Ordered Lists: Lock-Free Extensible Hash Tables
- High Performance Dynamic Lock-Free Hash Tables and List-Based Sets
- Practical lock-freedom
- Real-Time Computing with Lock-Free Shared Objects
- Scalable lock-free dynamic memory allocation
- simple, fast, and practical non blocking and blocking concurrent queue algorithms
- Hazard pointers
- Safe memory reclamation for dynamic lock-free objects using atomic reads and writes
- The repeat offender problem (ROP)
- Using Elimination to Implement Scalable and Lock-Free FIFO Queues
- J. D. Valois. Implementing Lock-Free Queues.
- Correction of a memory management method for lock-free data structures
- Allocating memory in a lock-free manner
- A Scalable Lock-free Stack Algorithm
- A Practical Multi-Word Compare-and-Swap Operation
- Scalable Queue-Based Spin Locks with Timeout
- Mostly Lock-Free Malloc
- Wait free SPSC bounded and unbounded queue with no barrier
others :
- Gcc atomic wiki page
- Lock free containers
- Atomic counter whitepaper
- Futexes
- Mutex and memory visibility
- Memory Barriers: a Hardware View for Software Hackers
- Introduction to lock-free/wait-free and the ABA problem here and here
- non blocking multi producers single consumer queue
- multi producer/multi consumer
- What every programmer should know about memory
- Understanding the linux kernel
- Articles at locklessinc
- Articles at 1024 cores
- Blog at 1024 cores
- Memory Ordering in Modern Microprocessors, Part I
- Memory Ordering in Modern Microprocessors, Part II
- What is RCU
- urcu
- My first RCU design
- Modern micro processors
- cbloom rants
- Flat combining
libraries :
- libsync
- atomic_ops
- gcc atomic builtins
- thread building blocks
- Atomic ptr plus project
- lib lock free data structure
- a scalable allocator : OSlash
- cpp framework
- lock free
- boost began impl of c++0x
- google perf tools
- boost impl of c++0x
- review of available atomic libs
- concurency kit
Mercredi, 21. Mars 2012
Zoe Lise Ewann part 2
Vignettes
Contenu de la gallerie
Grandes photos
Contenu de la gallerie
Voir les commentaires...Zoe-Lise-Ewann part 1/2
Vignettes
Contenu de la gallerie
Grandes photos
Contenu de la gallerie
Voir les commentaires...Photos en retard noel
vignettes
Contenu de la gallerie
grand format
Contenu de la gallerie
Voir les commentaires...Photos en retard noel
vignettes
Contenu de la gallerie
grand format
Contenu de la gallerie
Voir les commentaires...Photos en retard de l'anniv des petits
vignettes
Contenu de la gallerie
photos
Contenu de la gallerie
Voir les commentaires...Photos en retard d'halloween a l'ecole de Zoe.
vignettes
Contenu de la gallerie
photos
Contenu de la gallerie
Voir les commentaires...Photos en retard d'halloween
vignettes
Contenu de la gallerie
photos
Contenu de la gallerie
Voir les commentaires...Photos en retard d'Aout 2011
vignettes
Contenu de la gallerie
photos
Contenu de la gallerie
Voir les commentaires...Dimanche, 11. Mars 2012
Photos en retard de Aout 2011
vignettes
Contenu de la gallerie
grand format
Contenu de la gallerie
Voir les commentaires...Vendredi, 9. Mars 2012
Photos en retard de Juin 2011
vignettes
Contenu de la gallerie
big photos
Contenu de la gallerie
Voir les commentaires...Mardi, 27. Décembre 2011
Quelques photos de notre voyage a San Diego. Il n'y a malheureusement pas tout : il nous manque nottamment Balboa Park puisque nous sommes parti sans l'appareil photo.
Vignettes San Diego
Contenu de la gallerie
Grosses photos San Diego
Contenu de la gallerie
Voir les commentaires...Jeudi, 8. Septembre 2011
First attempt to full implementation of the c++ atomic library.
platform :
- linux intel 32 bits
- linux intel 64 bits
Right now the implementation is using gcc's intrinsic so it's over synchronizing, I'll do an assembly implementation with less compiler barrier later.
Enjoy !
LGPL template based implementation. Supports for both c++98 and c++0x atomic-2011-09-07.tgz.
see file INSTALL for instructions to build
Older version :
- macro based impl : atomic.tgz.
- template based impl atomic-2010-11-21.tgz. This version will be easier to extend to other platform if needed.
- more factorisation in template based impl atomic-2010-11-23.tgz. Platform specific code reduced to minimal.
- fixes + license + doc + starts of support for c++98 atomic-2010-11-26.tgz.
- support for both c++98 and c++0x atomic-2010-12-10.tgz
- fixes for assert.h included at the wrong place + operator= and cast operator not using seq_cst model + load relaxed not preventing the compiler to cache/optimize out the value atomic-2011-01-31.tgz
- fixes to remove strict aliasing warnings in release mode atomic-2011-09-07.tgz
Vendredi, 2. Septembre 2011
Integers
- http://www.jjj.de/bitwizardry/bitwizardrypage.html
- http://graphics.stanford.edu/~seander/bithacks.html
Floating points
- http://chrishecker.com/images/f/fb/Gdmfp.pdf
- http://stereopsis.com/FPU.html
- http://stereopsis.com/sree/fpu2006.html
Dimanche, 13. Mars 2011
Hello,
une petite video de ma p'tite puce qui fait du patin.
Voir les commentaires...Jeudi, 10. Mars 2011
place holder for some links on rdma :
- openfabrics wiki
- openfabrics linux with source code
- roland's rdma blog
- rdma consortium with some links/faq/tutorials
- infiniband software on linux
- Introduction to infiniband
- Cisco OpenFabrics
- Infiniband howto
- Troubleshooting infiniband connections
- Introduction to IB
- Infiniband architecture specification
- Building an rdma capable app. Intro
- Building an rdma capable app. The server
- Building an rdma capable app. The client
- Building an rdma capable app. rdma read/write
- Infiniband and 10G Ethernet for dummies
- Mellanox RDMA programming user manual
- IBM OFED doc
- Voltaire OFED doc
- Infiniband network architeture
From the sources it is possible to download some client/server code using rdma with infiniband. Another technology is rdma with iwarp (rdma over ethernet if I have understood correctly).
Voir les commentaires...Samedi, 5. Mars 2011
Ca y est : il marche le petiot ! Wow youhouhou ! Depuis le temps qu'on attendait ca
Je ne savais pas que ca me ferait autant plaisir de le voir marcher !
Bon tonton Damien : tu es battu : ton neveu aura marche a 15 mois et demi ! Qu'est ce qu'on gagne ?
Voir les commentaires...Dimanche, 6. Février 2011
En retard, on est tres en retard !
Voici les dernieres photos qui courent sur les 3 derniers mois. Enjoy !
Vignettes
Contenu de la gallerie
Photos
Contenu de la gallerie
Voir les commentaires...Lundi, 31. Janvier 2011
Intro
Former papers helped me to formalize all the different articles I had to read in order to make sense of the new memory model for c++.
It's now time to emit some rules in order to implement it (on x86).
Operations
The standard library defines a number of operations that take in parameter the memory model to use
- load
- store
- exchange
- compare_exchange_weak
- compare_exchange_strong
- fetch_add
- fetch_sub
- fetch_and
- fetch_or
- fetch_xor
And the memory models allowed are
- memory_order_relaxed
- memory_order_consume
- memory_order_acquire
- memory_order_release
- memory_order_acq_rel
- memory_order_seq_cst
By default, all atomic operations use the "safe" memory_order_seq_cst model. Not all memory model are applicable to all operations. For example, it is illegal to call a store operation while specifying an acquire or consume memory model.
x86 rules
Found from an article of Bartosz Milewski :
- Intel specs. See chapter 8.2 "Memory ordering" p.
328.
- AMD specs. See chapter 7 "Memory system" p.207.
Guarantees :
- No store/store reordering
- No load/load reordering
- No store/load reordering
- lock instruction acts as full memory barrier (no store/load reordered with respects to locked operations)
- total order of stores
- total order of lock instruction
- memory visibility obeys causality (respects transitive visibility)
Not guaranteed :
- load/store reordering (a load can be reordered with respects to older stores)
compiler barrier
Just insert an empty inline asm statement which has the "volatile" attribute and specifies "memory" in the clobbered list. This indicates that the instruction might touch all memory and that gcc cannot reorder with respect to that instruction and also that gcc needs to reload registers if it has cached any value before the statement.
__asm__ __volatile__ ("" ::: "memory");
memory clobber
For now implemented as a compiler barrier. I'm still searching wether we can force the memory clobber without the ordering guarantee.
The memory clobber is needed in load relaxed and load consume memory model so that we are guaranteed the compiler will not cache the early load in a register. This could lead to infinite loop.
This behaviour is mandated by 1.10/25 in the atomic library spec that says :
An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
It means that even with relaxed memory order, the compiler must ensure that any store are visible in a finite number of steps. Therefore, a load relaxed must be reloaded (if not all the time, at least from time to time in a finite number of steps).
In that respect, a relaxed load act as a volatile read.
cpu barrier
We can use gcc's builtin
__sync_synchronize()
load
Load from memory.
If memory model is
- relaxed : memory clobber
- consume : memory clobber
- acquire : compiler barrier after the load
- release : illegal
- acq_rel : illegal
- seq_cst : compiler barrier after the load
store
Store to memory
If memory model is
- relaxed : nothing special to do
- consume : illegal
- acquire : illegal
- release : compiler barrier before the store
- acq_rel : illegal
- seq_cst : compiler barrier before the store, cpu barrier after the store, or use an atomic xchg operation which combines both and is one instruction shorter. (and can also be made a compiler barrier at the same time)
exchange
This operation defines the equivalent C code to be executed atomically :
exchange(ptr, new_value) : tmp = *ptr // load *ptr = new_value // store return tmp;
we can use the gcc builtin
__sync_lock_test_and_set()
which also acts as an acquire barrier.
It means we have to do extra stuff for this one to work :
- relaxed : nothing special to do (we already are doing too much sync)
- consume : nothing special to do (we already are doing too much sync)
- acquire : nothing special to do (the barrier after is already included)
- release : compiler barrier before (we are doing too much sync because of the barrier after being included)
- acq_rel : compiler barrier before (the barrier after is already included)
- seq_cst : compiler barrier before (the barrier after is already included)
If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.
In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :
- relaxed : memory clobber
- consume : memory clobber
- acquire : compiler barrier after the xchg
- release : compiler barrier before the xchg
- acq_rel : compiler barrier before and after
- seq_cst : compiler barrier before and after
compare_exchange_weak
There is no compare exchange "weak" operation on x86. It is equivalent to a compare exchange "strong" operation.
In the standard, the difference is that it allows it to fails spuriously. I guess this is to allow implementation using LL/SC instead of a CAS.
compare_exchange_strong
This operation defines the equivalent C code to be executed atomically :
compare_exchange_strong(ptr, expected, desired) :
tmp = *ptr
if (tmp == expected)
{
*ptr == desired
}
return tmp;
In the C++ API, this operation takes 2 different memory model parameters. One to use in case of success, one to use in case of failure.
we can use the gcc builtin
__sync_val_compare_and_swap()
which already includes a compiler barrier.
Therefore we don't have anything else to add.
If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.
In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :
- relaxed : memory clobber
- consume : memory clobber
- acquire : compiler barrier after the cmpxchg
- release : compiler barrier before the cmpxchg
- acq_rel : compiler barrier before and after
- seq_cst : compiler barrier before and after
fetch_add
This operation defines the equivalent C code to be executed atomically :
fetch_add(ptr, increment) : tmp = *ptr tmp += increment *ptr = tmp
we can use the gcc builtin
__sync_fetch_and_add()
which already includes a compiler barrier.
Therefore we don't have anything else to add.
If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.
In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :
- relaxed : memory clobber
- consume : memory clobber
- acquire : compiler barrier after the xadd
- release : compiler barrier before the xadd
- acq_rel : compiler barrier before and after
- seq_cst : compiler barrier before and after
fetch_sub
This operation defines the equivalent C code to be executed atomically :
fetch_sub(ptr, decrement) : tmp = *ptr tmp -= decrement *ptr = tmp
we can use the gcc builtin
__sync_fetch_and_sub()
which already includes a compiler barrier.
Therefore we don't have anything else to add.
If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.
In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :
- relaxed : memory clobber
- consume : memory clobber
- acquire : compiler barrier after the xadd
- release : compiler barrier before the xadd
- acq_rel : compiler barrier before and after
- seq_cst : compiler barrier before and after
fetch_and
This operation defines the equivalent C code to be executed atomically :
fetch_and(ptr, value) : tmp = *ptr tmp &= value *ptr = tmp
we can use the gcc builtin
__sync_fetch_and_and()
which already includes a compiler barrier.
Therefore we don't have anything else to add.
If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.
In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :
- relaxed : memory clobber
- consume : memory clobber
- acquire : compiler barrier after the xadd
- release : compiler barrier before the xadd
- acq_rel : compiler barrier before and after
- seq_cst : compiler barrier before and after
fetch_or
This operation defines the equivalent C code to be executed atomically :
fetch_or(ptr, value) : tmp = *ptr tmp |= value *ptr = tmp
we can use the gcc builtin
__sync_fetch_and_or()
which already includes a compiler barrier.
Therefore we don't have anything else to add.
If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.
In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :
- relaxed : memory clobber
- consume : memory clobber
- acquire : compiler barrier after the xadd
- release : compiler barrier before the xadd
- acq_rel : compiler barrier before and after
- seq_cst : compiler barrier before and after
fetch_xor
This operation defines the equivalent C code to be executed atomically :
fetch_xor(ptr, value) : tmp = *ptr tmp ^= value *ptr = tmp
we can use the gcc builtin
__sync_fetch_and_xor()
which already includes a compiler barrier.
Therefore we don't have anything else to add.
If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.
In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :
- relaxed : memory clobber
- consume : memory clobber
- acquire : compiler barrier after the xadd
- release : compiler barrier before the xadd
- acq_rel : compiler barrier before and after
- seq_cst : compiler barrier before and after
Intro
So far, we've only seen the sequential consistency model.
The new C++ standard defines 3 other models :
- acquire/release
- consume/release
- relaxed
acquire/release
This model comes from the semantic associated with locks :
- when a lock is acquired, the code in the critical section must not be visible before the lock is acquired
- when a lock is released, the code in the critical section must not be visible after the lock is released
From the rules of the previous article :
- we need a barrier right after the lock is acquired
- we need a barrier right before the lock is released
Since this atomic library is all about writing code with atomic operations and without locks, this memory model applies with loads and stores as well :
- code that follows a load with acquire semantic must not be visible before the "equivalent virtual" lock is acquired
- code precending a store with release semantic must not be visible after the "equivalent virtual" lock is released
In this example :
Thread 1 Thread 2 x = 42 while (!x_init) ; x_init = true r1 = x;
x_init synchronizes in the 2 threads, instead of using a sequential consistency model we can use acquire/release :
Thread 1 Thread 2 x = 42 while (!x_init.load(memory_order_acquire)) ; x_init.store(true, memory_order_release) r1 = x;
In that model, the "equivalent virtual" lock used when acquiring/releasing
- protects non atomic variables that are sequenced before the store release.
- protects non atomic variables that are sequenced after the load acquire.
- synchronizes only code using the same atomic variable accross multiple threads
In the previous example, x is not atomic, but the 2 threads are synchronized using the same atomic variable (x_init).
In thread 1, the store to x is sequenced before (in program order, the store to x is written before the store to x_init), therefore it happens before the store to x_init.
In thread 2, the load from x synchronizes with the store to x from thread 1. So when Thread 2 sees the store to x in thread 1, it also sees all operations that happened before it in thread 1.
Finally, since in thread 2, the while loop is sequenced before the load from x, it happens before the load.
Using the sequenced before property (intra thread property), the synchronization property (inter thread property) to establish the "happens before" property (inter thread property), and this property being transitive, we can say that the store to x in thread 1 happens before the load from x in thread 2.
Another non intuitive example :
Thread 1- y.store (20, memory_order_release);
Thread 2- x.store (10, memory_order_release);
Thread 3- assert (y.load (memory_order_acquire) == 20 && x.load(memory_order_acquire) == 0)
Thread 4- assert (y.load (memory_order_acquire) == 0 && x.load(memory_order_acquire) == 10)
In that example
- the store to y in thread 1 synchronizes with the load from it in thread 3 and in thread 4.
- the store to x in thread 2 synchronizes with the load from it in thread 3 and in thread 4.
However since thread 1 and thread 2 stores to unrelated variables they do not synchronize with each other.
The net effect is that at the same time
- thread 3 is allowed to see the store to y but not the store to x
- thread 4 is allowed to see the store to x, but not the store to y
There is no busy waiting in thread 3 and thread 4, so it means that thread 3 and thread 4 are not guaranteed to see the store at all. What is guaranteed is that if they see the store then they also see what happened before.
So both asserts are allowed to succeed. (They are not guaranteed to succeed, they are just allowed to succeed).
With sequential consistency, a total order is established, so one of the store must happen before and is visible by all cores at the same time. Therefore it is guaranteed that at least one assert will fail.
How is this behaviour even possible ?
Well, I don't think this is possible on x86, but on some architectures, we can imagine that a core can synchronize with another core but not with all cores on the system, allowing that behaviour.
Imagine a hierarchical NUMA system : cores which are near to each other (sharing the same memory bank) are more likely to see the changes triggered by their neighbor before any remote cores. With acquire/release, we explicitly allow that behaviour, while with sequential consistency, we force all the cores to synchronize. (which is obviously more costly)
Note : on x86, the barrier before the release store needs to be a compiler barrier. The barrier after the acquire load needs to be a compiler barrier as well.
Once again on x86, only a load can be reordered with respect to older stores. (load/store reordering)
We don't want code from within the critical section to be visible before the lock is acquired or after the lock is released.
When we store release, we don't want the code before the store to be visible after it, this is not possible since on x86, the stores won't be reordered. We need only a compiler barrier.
When we load acquire, we don't want the code after the load to be visible before it. Since there is no CPU load/load nor store/load reordering we only need a compiler barrier.
consume/release
Consume/release is similar to acquire/release but in a more relaxed way. We only want to establish a synchronization between a store and a load. There is no lock semantic associated to it. So we're not trying to prevent any code from escaping a critical section.
So the only difference with a relaxed store/load is we are indicating the CPU that any dependant variable in the load expression must not be reordered before the load.
Example :
Thread 1 n = 1 x.store(&n, memory_order_release)
Thread 2 y = x.load(memory_order_consume) z = *y
Yes, some CPUs (alpha) are very aggressive in what they reorder, and they can even reorder that. I think it is safe to say a compiler would never reorder that on its own.
Note : on x86, the semantic of consume is guaranteed without doing anything, so I think the consume is equivalent to a relax memory model.
relaxed
It just means that we don't put any ordering constraint on the operations. Compilers and CPUs are free to reorder as much as they want.
I guess this is needed only in rare cases where we have to use a synchronization variable at one place of the algorithm and in a second phase of the algorithm we don't need any special guarantee on the same synchronization variable. (because perhaps another synchronization varible is used to enforce the order)
Update : not quite. Since the atomic lib spec states that
"An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time. "
It means that a load of a memory_order_relaxed must still maintain the guarantee that the compiler will reload (if not all the time at least in a finite number of steps) values from memory. In that respect, an atomic load with relaxed memory order behaves a little bit like a volatile atomic access in that it prevents the compiler from caching forever and optimizing out some conditions/loops if it can prove no change can happen in the current thread.
Other links
Just found that post on Anthony's Williams blog : The Intel x86 Memory Ordering Guarantees and the C++ Memory Model
This confirms what's been writen above. Too bad I find it just now

