A debugger barrier

luu | 48 points

Is "standard practice" nowadays to use C++11 memory orderings or barriers, instead of platform-specific macros?.I know that's the case in the Rust world.

nyanpasu64 | a year ago

To check my understanding:

The bug looks a bit like these two threads running in parallel:

  A: x := N
  A: y := N
  A: done := true
  
  B: if(done) {
  B:   assert(x == y)
  B: } else { retry }
Where the assert in thread B replaces some other code which had x == y as a precondition. The problem was thought to be that, due to weak memory ordering on the processor, B would sometimes observe done to be true but x != y.

The fix would be to insert appropriate memory barriers to get the threads to see consistent data.

The problem is to try to determine whether the hypothesis was correct. The bug is rare and the proposed fix might make it more rare. It would be bad if the fix just made another bug harder to discover.

The idea was then that to test this, you hit the edge case and test if the memory fence is sufficient to cause the assertion to succeed. But instead of rewriting the code to something like the following, you can just rely on invoking the debugger being sufficiently invasive to have the effect of the fence and then rerun the assertion.

  B: if(done) {
  B:   if(x != y) {
  B:     fence();
  B:     assert(x == y);
  B:     printf("Gotcha!\n");
  B:   }
  B:   // Now use x, y as before…
  B: } else { retry }
One issue is that while this finds evidence that missing memory barriers were sometimes the cause, protecting from the case where the cpu is just having a good day while you’re testing and the bug-triggering case doesn’t happen, it doesn’t really prove that the only cause of x != y issue (this example is presumably distilled from a single potential cause of something more complicated) is caused by the memory barrier problem (eg maybe a rare buffer overflow sets done to true too early).

Is that a fair summary or did I miss something?

dan-robertson | a year ago

I kind of knew where this was going in terms of the actual bug, but the way they've "proven" their fix is indeed pretty neat. I would have thoroughly explained the bug and fix, but in order to actually verify I honestly most probably would have just done what they suggested (and rejected) first: Run it long enough to show that the issue does not occur anymore.

As an aside, I have not stared at the problem long enough to say this with full confidence, but if the barriers they inserted are full barriers, they might be too strong. acquire/release might have been sufficient.

anyfoo | a year ago

Memory Reordering Caught in the Act covers this topic and is very well written: https://preshing.com/20120515/memory-reordering-caught-in-th...

amenghra | a year ago

Despite being popular in x86-centric communities, the "barrier" theory of multiprogramming cannot be correctly extended to weak architectures. Just learn release/acquire semantics which work everywhere and are fairly straightforward.

firstlink | a year ago

Does anyone have a good reference on how barriers of various “strengths” are actually implemented on several modern hardware platforms?

cmason | a year ago