Myths Programmers Believe about CPU Caches (2018)

whack | 153 points

Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:

  /**
   *  @brief Defines variable alignment to avoid false sharing.
   *  @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
   *  @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
   *
   *  The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
   *
   *  @code{.cpp}
   *  #if defined(__cpp_lib_hardware_interference_size)
   *  static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
   *  #else
   *  static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
   *  #endif
   *  @endcode
   *
   *  That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
   *  unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
   *  with `-Wno-interference-size`.
   */
  static constexpr std::size_t default_alignment_k = 128;
As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.

[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...

ashvardanian | 2 days ago

Did this article share more than one myth?

The reason why programmers don’t believe in cache coherency is because they have experienced a closely related phenomena, memory reordering. This requires you to use a memory fence when accessing a shared value between multiples cores - as if they needed to synchronize.

groundzeros2015 | 2 days ago

I may be completely out of line here, but isn't the story on ARM very very different? I vaguely recall the whole point of having stuff like weak atomics being that on x86, those don't do anything, but on ARM they are essential for cache coherency and memory ordering? But then again, I may just be conflating memory ordering and coherency.

daemontus | 2 days ago

Coherent cache is transparent to the memory model. So if someone trying to explain memory model and ordering mentioned cache as affecting the memory model, it was pretty much a sign they didn't fully understand what they were talking about.

j_seigh | 2 days ago

Very interesting, I am hardly an expert, but this gives the impression that if only without that meddling software we will all live in a synchronous world.

This ignores store buffers and consequently memory fencing which is the basis for the nightmarish std::memory_order, the worst api documentation you will ever meet

breppp | 2 days ago

Since the CPU is doing cache coherency transparently, perhaps there should be some sort of way to promise that an application is well-behaved in order to access a lower-level non-transparent instruction set to manually manage the cache coherency from the application level. Or perhaps applications can never be trusted with that level of control over the hardware. The MESI model reminded me of Rust's ownership and borrowing. The pattern also appears in OpenGL vs Vulkan drivers, implicit sync vs explicit sync. Yet another example would be the cache management work involved in squeezing out maximum throughput CUDA on an enterprise GPU.

QuaternionsBhop | 2 days ago

2018, discussed on HN in 2023: https://news.ycombinator.com/item?id=36333034

  Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
 176 points by whack on June 14, 2023 | hide | past | favorite | 138 comments
yohbho | 3 days ago

The author talks about Java and cache coherency, but ignores the fact that the JVM implementation is above the processors caches and per the spec can have different caches per thread.

This means you need to synchronize every shared access, whether it's a read or write. In hardware systems you can cheat because usually a write performs a write-through. In a JVM that's not the case.

It's been a long time since I had to think about this, but it bit us pretty hard when we found that.

mannyv | 2 days ago

This post, along with the tutorial links it and the comments contain, provide good insights on the topic of caches, coherence, and related topics. I would like to add a link that I feel is also very good, maybe better:

<https://marabos.nl/atomics/hardware.html>

While the book this chapter is from is about Rust, this chapter is pretty much language-agnostic.

wpollock | 2 days ago