David's Wikiblog

 

Mardi, 17. Avril 2012

 

Dr.Dobb's articles by Herb Sutter :

Dr.Dobb's articles by Andrei Alexandrescu :

Articles by Bartosz Milewski :

Theoretical papers :

others :

libraries :

Voir les commentaires...

Mercredi, 21. Mars 2012

 

Zoe Lise Ewann part 2

Vignettes

Contenu de la gallerie

Grandes photos

Contenu de la gallerie

Voir les commentaires... 

Zoe-Lise-Ewann part 1/2

Vignettes

Contenu de la gallerie

Grandes photos

Contenu de la gallerie

Voir les commentaires... 

Photos en retard noel

vignettes

Contenu de la gallerie

grand format

Contenu de la gallerie

Voir les commentaires... 

Photos en retard noel

vignettes

Contenu de la gallerie

grand format

Contenu de la gallerie

Voir les commentaires... 

Photos en retard de l'anniv des petits

vignettes

Contenu de la gallerie

photos

Contenu de la gallerie

Voir les commentaires... 

Photos en retard d'halloween a l'ecole de Zoe.

vignettes

Contenu de la gallerie

photos

Contenu de la gallerie

Voir les commentaires... 

Photos en retard d'halloween

vignettes

Contenu de la gallerie

photos

Contenu de la gallerie

Voir les commentaires... 

Photos en retard d'Aout 2011

vignettes

Contenu de la gallerie

photos

Contenu de la gallerie

Voir les commentaires...

Dimanche, 11. Mars 2012

 

Photos en retard de Aout 2011

vignettes

Contenu de la gallerie

grand format

Contenu de la gallerie

Voir les commentaires...

Vendredi, 9. Mars 2012

 

Photos en retard de Juin 2011

vignettes

Contenu de la gallerie

big photos

Contenu de la gallerie

Voir les commentaires...

Mardi, 27. Décembre 2011

 

Quelques photos de notre voyage a San Diego. Il n'y a malheureusement pas tout : il nous manque nottamment Balboa Park puisque nous sommes parti sans l'appareil photo.

Vignettes San Diego

Contenu de la gallerie

Grosses photos San Diego

Contenu de la gallerie

Voir les commentaires...

Jeudi, 8. Septembre 2011

 

First attempt to full implementation of the c++ atomic library.

platform :

  • linux intel 32 bits
  • linux intel 64 bits

Right now the implementation is using gcc's intrinsic so it's over synchronizing, I'll do an assembly implementation with less compiler barrier later.

Enjoy !

LGPL template based implementation. Supports for both c++98 and c++0x atomic-2011-09-07.tgz.

see file INSTALL for instructions to build

Older version :

Voir les commentaires...

Vendredi, 2. Septembre 2011

 

Integers

Floating points

Voir les commentaires...

Dimanche, 13. Mars 2011

 

Hello,

une petite video de ma p'tite puce qui fait du patin.

Voir les commentaires...

Jeudi, 10. Mars 2011

 

place holder for some links on rdma :

From the sources it is possible to download some client/server code using rdma with infiniband. Another technology is rdma with iwarp (rdma over ethernet if I have understood correctly).

Voir les commentaires...

Samedi, 5. Mars 2011

 

Ca y est : il marche le petiot ! Wow youhouhou ! Depuis le temps qu'on attendait ca ;-) Je ne savais pas que ca me ferait autant plaisir de le voir marcher !

Bon tonton Damien : tu es battu : ton neveu aura marche a 15 mois et demi ! Qu'est ce qu'on gagne ?

Voir les commentaires...

Dimanche, 6. Février 2011

 

En retard, on est tres en retard !

Voici les dernieres photos qui courent sur les 3 derniers mois. Enjoy !

Vignettes

Contenu de la gallerie

Photos

Contenu de la gallerie

Voir les commentaires...

Lundi, 31. Janvier 2011

 

Intro

Former papers helped me to formalize all the different articles I had to read in order to make sense of the new memory model for c++.

It's now time to emit some rules in order to implement it (on x86).

Operations

The standard library defines a number of operations that take in parameter the memory model to use

  • load
  • store
  • exchange
  • compare_exchange_weak
  • compare_exchange_strong
  • fetch_add
  • fetch_sub
  • fetch_and
  • fetch_or
  • fetch_xor

And the memory models allowed are

  • memory_order_relaxed
  • memory_order_consume
  • memory_order_acquire
  • memory_order_release
  • memory_order_acq_rel
  • memory_order_seq_cst

By default, all atomic operations use the "safe" memory_order_seq_cst model. Not all memory model are applicable to all operations. For example, it is illegal to call a store operation while specifying an acquire or consume memory model.

x86 rules

Found from an article of Bartosz Milewski :

328.

  • AMD specs. See chapter 7 "Memory system" p.207.

Guarantees :

  • No store/store reordering
  • No load/load reordering
  • No store/load reordering
  • lock instruction acts as full memory barrier (no store/load reordered with respects to locked operations)
  • total order of stores
  • total order of lock instruction
  • memory visibility obeys causality (respects transitive visibility)

Not guaranteed :

  • load/store reordering (a load can be reordered with respects to older stores)

compiler barrier

Just insert an empty inline asm statement which has the "volatile" attribute and specifies "memory" in the clobbered list. This indicates that the instruction might touch all memory and that gcc cannot reorder with respect to that instruction and also that gcc needs to reload registers if it has cached any value before the statement.

 __asm__ __volatile__ ("" ::: "memory");

memory clobber

For now implemented as a compiler barrier. I'm still searching wether we can force the memory clobber without the ordering guarantee.

The memory clobber is needed in load relaxed and load consume memory model so that we are guaranteed the compiler will not cache the early load in a register. This could lead to infinite loop.

This behaviour is mandated by 1.10/25 in the atomic library spec that says :

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

It means that even with relaxed memory order, the compiler must ensure that any store are visible in a finite number of steps. Therefore, a load relaxed must be reloaded (if not all the time, at least from time to time in a finite number of steps).

In that respect, a relaxed load act as a volatile read.

cpu barrier

We can use gcc's builtin

 __sync_synchronize()

load

Load from memory.

If memory model is

  • relaxed : memory clobber
  • consume : memory clobber
  • acquire : compiler barrier after the load
  • release : illegal
  • acq_rel : illegal
  • seq_cst : compiler barrier after the load

store

Store to memory

If memory model is

  • relaxed : nothing special to do
  • consume : illegal
  • acquire : illegal
  • release : compiler barrier before the store
  • acq_rel : illegal
  • seq_cst : compiler barrier before the store, cpu barrier after the store, or use an atomic xchg operation which combines both and is one instruction shorter. (and can also be made a compiler barrier at the same time)

exchange

This operation defines the equivalent C code to be executed atomically :

 exchange(ptr, new_value) :
   tmp  = *ptr      // load
   *ptr = new_value // store
   return tmp;

we can use the gcc builtin

 __sync_lock_test_and_set()

which also acts as an acquire barrier.
It means we have to do extra stuff for this one to work :

  • relaxed : nothing special to do (we already are doing too much sync)
  • consume : nothing special to do (we already are doing too much sync)
  • acquire : nothing special to do (the barrier after is already included)
  • release : compiler barrier before (we are doing too much sync because of the barrier after being included)
  • acq_rel : compiler barrier before (the barrier after is already included)
  • seq_cst : compiler barrier before (the barrier after is already included)

If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.

In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :

  • relaxed : memory clobber
  • consume : memory clobber
  • acquire : compiler barrier after the xchg
  • release : compiler barrier before the xchg
  • acq_rel : compiler barrier before and after
  • seq_cst : compiler barrier before and after

compare_exchange_weak

There is no compare exchange "weak" operation on x86. It is equivalent to a compare exchange "strong" operation.

In the standard, the difference is that it allows it to fails spuriously. I guess this is to allow implementation using LL/SC instead of a CAS.

compare_exchange_strong

This operation defines the equivalent C code to be executed atomically :

 compare_exchange_strong(ptr, expected, desired) :
   tmp  = *ptr
   if (tmp == expected)
   {
     *ptr == desired
   }
   return tmp;

In the C++ API, this operation takes 2 different memory model parameters. One to use in case of success, one to use in case of failure.

we can use the gcc builtin

 __sync_val_compare_and_swap()

which already includes a compiler barrier.
Therefore we don't have anything else to add.

If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.

In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :

  • relaxed : memory clobber
  • consume : memory clobber
  • acquire : compiler barrier after the cmpxchg
  • release : compiler barrier before the cmpxchg
  • acq_rel : compiler barrier before and after
  • seq_cst : compiler barrier before and after

fetch_add

This operation defines the equivalent C code to be executed atomically :

 fetch_add(ptr, increment) :
   tmp = *ptr
   tmp += increment
   *ptr = tmp

we can use the gcc builtin

 __sync_fetch_and_add()

which already includes a compiler barrier.
Therefore we don't have anything else to add.

If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.

In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :

  • relaxed : memory clobber
  • consume : memory clobber
  • acquire : compiler barrier after the xadd
  • release : compiler barrier before the xadd
  • acq_rel : compiler barrier before and after
  • seq_cst : compiler barrier before and after

fetch_sub

This operation defines the equivalent C code to be executed atomically :

 fetch_sub(ptr, decrement) :
   tmp = *ptr
   tmp -= decrement
   *ptr = tmp

we can use the gcc builtin

 __sync_fetch_and_sub()

which already includes a compiler barrier.
Therefore we don't have anything else to add.

If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.

In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :

  • relaxed : memory clobber
  • consume : memory clobber
  • acquire : compiler barrier after the xadd
  • release : compiler barrier before the xadd
  • acq_rel : compiler barrier before and after
  • seq_cst : compiler barrier before and after

fetch_and

This operation defines the equivalent C code to be executed atomically :

 fetch_and(ptr, value) :
   tmp = *ptr
   tmp &= value
   *ptr = tmp

we can use the gcc builtin

 __sync_fetch_and_and()

which already includes a compiler barrier.
Therefore we don't have anything else to add.

If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.

In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :

  • relaxed : memory clobber
  • consume : memory clobber
  • acquire : compiler barrier after the xadd
  • release : compiler barrier before the xadd
  • acq_rel : compiler barrier before and after
  • seq_cst : compiler barrier before and after

fetch_or

This operation defines the equivalent C code to be executed atomically :

 fetch_or(ptr, value) :
   tmp = *ptr
   tmp |= value
   *ptr = tmp

we can use the gcc builtin

 __sync_fetch_and_or()

which already includes a compiler barrier.
Therefore we don't have anything else to add.

If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.

In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :

  • relaxed : memory clobber
  • consume : memory clobber
  • acquire : compiler barrier after the xadd
  • release : compiler barrier before the xadd
  • acq_rel : compiler barrier before and after
  • seq_cst : compiler barrier before and after

fetch_xor

This operation defines the equivalent C code to be executed atomically :

 fetch_xor(ptr, value) :
   tmp = *ptr
   tmp ^= value
   *ptr = tmp

we can use the gcc builtin

 __sync_fetch_and_xor()

which already includes a compiler barrier.
Therefore we don't have anything else to add.

If we want to allow the compiler to optimize more stuff, we need to define ourself the assembly so it does not include the compiler barrier.

In that case, the compiler is not seeing anymore that this operation might change the memory, so we need to put the compiler barriers ourself :

  • relaxed : memory clobber
  • consume : memory clobber
  • acquire : compiler barrier after the xadd
  • release : compiler barrier before the xadd
  • acq_rel : compiler barrier before and after
  • seq_cst : compiler barrier before and after
Voir les commentaires... 

Intro

So far, we've only seen the sequential consistency model.
The new C++ standard defines 3 other models :

  • acquire/release
  • consume/release
  • relaxed

acquire/release

This model comes from the semantic associated with locks :

  • when a lock is acquired, the code in the critical section must not be visible before the lock is acquired
  • when a lock is released, the code in the critical section must not be visible after the lock is released

From the rules of the previous article :

  • we need a barrier right after the lock is acquired
  • we need a barrier right before the lock is released

Since this atomic library is all about writing code with atomic operations and without locks, this memory model applies with loads and stores as well :

  • code that follows a load with acquire semantic must not be visible before the "equivalent virtual" lock is acquired
  • code precending a store with release semantic must not be visible after the "equivalent virtual" lock is released

In this example :

Thread 1         Thread 2
x = 42           while (!x_init) ;
x_init = true    r1 = x;

x_init synchronizes in the 2 threads, instead of using a sequential consistency model we can use acquire/release :

Thread 1                                  Thread 2
x = 42                                    while (!x_init.load(memory_order_acquire)) ;
x_init.store(true, memory_order_release)  r1 = x;

In that model, the "equivalent virtual" lock used when acquiring/releasing

  • protects non atomic variables that are sequenced before the store release.
  • protects non atomic variables that are sequenced after the load acquire.
  • synchronizes only code using the same atomic variable accross multiple threads

In the previous example, x is not atomic, but the 2 threads are synchronized using the same atomic variable (x_init).

In thread 1, the store to x is sequenced before (in program order, the store to x is written before the store to x_init), therefore it happens before the store to x_init.
In thread 2, the load from x synchronizes with the store to x from thread 1. So when Thread 2 sees the store to x in thread 1, it also sees all operations that happened before it in thread 1.
Finally, since in thread 2, the while loop is sequenced before the load from x, it happens before the load.
Using the sequenced before property (intra thread property), the synchronization property (inter thread property) to establish the "happens before" property (inter thread property), and this property being transitive, we can say that the store to x in thread 1 happens before the load from x in thread 2.

Another non intuitive example :

Thread 1-
y.store (20, memory_order_release);
Thread 2-
x.store (10, memory_order_release);
Thread 3-
assert (y.load (memory_order_acquire) == 20 && x.load(memory_order_acquire) == 0)
Thread 4-
assert (y.load (memory_order_acquire) == 0 && x.load(memory_order_acquire) == 10)

In that example

  • the store to y in thread 1 synchronizes with the load from it in thread 3 and in thread 4.
  • the store to x in thread 2 synchronizes with the load from it in thread 3 and in thread 4.

However since thread 1 and thread 2 stores to unrelated variables they do not synchronize with each other.
The net effect is that at the same time

  • thread 3 is allowed to see the store to y but not the store to x
  • thread 4 is allowed to see the store to x, but not the store to y

There is no busy waiting in thread 3 and thread 4, so it means that thread 3 and thread 4 are not guaranteed to see the store at all. What is guaranteed is that if they see the store then they also see what happened before.

So both asserts are allowed to succeed. (They are not guaranteed to succeed, they are just allowed to succeed).

With sequential consistency, a total order is established, so one of the store must happen before and is visible by all cores at the same time. Therefore it is guaranteed that at least one assert will fail.

How is this behaviour even possible ?
Well, I don't think this is possible on x86, but on some architectures, we can imagine that a core can synchronize with another core but not with all cores on the system, allowing that behaviour.
Imagine a hierarchical NUMA system : cores which are near to each other (sharing the same memory bank) are more likely to see the changes triggered by their neighbor before any remote cores. With acquire/release, we explicitly allow that behaviour, while with sequential consistency, we force all the cores to synchronize. (which is obviously more costly)

Note : on x86, the barrier before the release store needs to be a compiler barrier. The barrier after the acquire load needs to be a compiler barrier as well.

Once again on x86, only a load can be reordered with respect to older stores. (load/store reordering)

We don't want code from within the critical section to be visible before the lock is acquired or after the lock is released.

When we store release, we don't want the code before the store to be visible after it, this is not possible since on x86, the stores won't be reordered. We need only a compiler barrier.

When we load acquire, we don't want the code after the load to be visible before it. Since there is no CPU load/load nor store/load reordering we only need a compiler barrier.

consume/release

Consume/release is similar to acquire/release but in a more relaxed way. We only want to establish a synchronization between a store and a load. There is no lock semantic associated to it. So we're not trying to prevent any code from escaping a critical section.

So the only difference with a relaxed store/load is we are indicating the CPU that any dependant variable in the load expression must not be reordered before the load.

Example :

 Thread 1
 n = 1
 x.store(&n, memory_order_release)
 Thread 2
 y = x.load(memory_order_consume)
 z = *y

Yes, some CPUs (alpha) are very aggressive in what they reorder, and they can even reorder that. I think it is safe to say a compiler would never reorder that on its own.

Note : on x86, the semantic of consume is guaranteed without doing anything, so I think the consume is equivalent to a relax memory model.

relaxed

It just means that we don't put any ordering constraint on the operations. Compilers and CPUs are free to reorder as much as they want.

I guess this is needed only in rare cases where we have to use a synchronization variable at one place of the algorithm and in a second phase of the algorithm we don't need any special guarantee on the same synchronization variable. (because perhaps another synchronization varible is used to enforce the order)

Update : not quite. Since the atomic lib spec states that

"An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time. "

It means that a load of a memory_order_relaxed must still maintain the guarantee that the compiler will reload (if not all the time at least in a finite number of steps) values from memory. In that respect, an atomic load with relaxed memory order behaves a little bit like a volatile atomic access in that it prevents the compiler from caching forever and optimizing out some conditions/loops if it can prove no change can happen in the current thread.

Other links

Just found that post on Anthony's Williams blog : The Intel x86 Memory Ordering Guarantees and the C++ Memory Model

This confirms what's been writen above. Too bad I find it just now ;-)

Voir les commentaires...
Toutes les photos et textes sous license Creative Commons by-nc-sa par Nancy Gauthier et David Jobet