Did this article share more than one myth?
The reason why programmers don’t believe in cache coherency is because they have experienced a closely related phenomena, memory reordering. This requires you to use a memory fence when accessing a shared value between multiples cores - as if they needed to synchronize.
I may be completely out of line here, but isn't the story on ARM very very different? I vaguely recall the whole point of having stuff like weak atomics being that on x86, those don't do anything, but on ARM they are essential for cache coherency and memory ordering? But then again, I may just be conflating memory ordering and coherency.
Coherent cache is transparent to the memory model. So if someone trying to explain memory model and ordering mentioned cache as affecting the memory model, it was pretty much a sign they didn't fully understand what they were talking about.
Very interesting, I am hardly an expert, but this gives the impression that if only without that meddling software we will all live in a synchronous world.
This ignores store buffers and consequently memory fencing which is the basis for the nightmarish std::memory_order, the worst api documentation you will ever meet
Since the CPU is doing cache coherency transparently, perhaps there should be some sort of way to promise that an application is well-behaved in order to access a lower-level non-transparent instruction set to manually manage the cache coherency from the application level. Or perhaps applications can never be trusted with that level of control over the hardware. The MESI model reminded me of Rust's ownership and borrowing. The pattern also appears in OpenGL vs Vulkan drivers, implicit sync vs explicit sync. Yet another example would be the cache management work involved in squeezing out maximum throughput CUDA on an enterprise GPU.
2018, discussed on HN in 2023: https://news.ycombinator.com/item?id=36333034
Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
176 points by whack on June 14, 2023 | hide | past | favorite | 138 commentsThe author talks about Java and cache coherency, but ignores the fact that the JVM implementation is above the processors caches and per the spec can have different caches per thread.
This means you need to synchronize every shared access, whether it's a read or write. In hardware systems you can cheat because usually a write performs a write-through. In a JVM that's not the case.
It's been a long time since I had to think about this, but it bit us pretty hard when we found that.
This post, along with the tutorial links it and the comments contain, provide good insights on the topic of caches, coherence, and related topics. I would like to add a link that I feel is also very good, maybe better:
<https://marabos.nl/atomics/hardware.html>
While the book this chapter is from is about Rust, this chapter is pretty much language-agnostic.
Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:
As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...