An x64 core roughly corresponding to a SM, or in the amdgpu world a compute unit (CU) seems right. It&#x27;s in the same ballpark for power consumption, represents the component handling an instruction pointer and a local register file and so forth.A really big CPU is a couple of hundred cores, a big GPU is a few hundred SM &#x2F; CUs. Some low power chips are 8 x64 cores and 8 CUs on the same package. All roughly lines up.

Hi Raph, first of all thank you for all of your contributions and writings - I&#x27;ve learned a ton from reading your blog!A minor quibble amidst your good comparison above ;)For a zen5 core, we have 16-wide SIMD with 4 pipes; 2 are FMA (2 flop), and 2 are FADD @ ~5GHZ. I math that out to 16 * 6 * 5 = 480 GFLOP&#x2F;core... am I missing something?

For those of us not fluent in codenames:Granite Ridge core = Zen 5 core.

&gt; It&#x27;s almost certainly best just to compare TFLOPSDepends on what you&#x27;re comparing with what, and the context, of course.Casey is doing education, so that people learn how best to program these devices. A mere comparison of TFLOPS of CPU vs GPU would be useless towards those ends. Similarly, just a bare comparison of TFLOPS between different GPUs even of the same generation would mask architectural differences in how to in practice achieve those theoretical TFLOPS upper bounds.I think Casey believes most people don&#x27;t know how to program well for these devices&#x2F;architectures. In that context, I think it&#x27;s appropriate to be almost dismissive of TFLOPS comparison talk.

&gt; It&#x27;s almost certainly best just to compare TFLOPS - also a bit of a slippery concept, as that depends on the precisionAgreed. Some quibbles about the slipperiness of the concept.flops are floating point operations. IMO it should not be confusing at all, just count single precision floating point operations, which all devices can do, and which are explicitly defined in the IEEE standard.Half precision flops are interesting but should be called out for the non-standard metric they are. Anyone using half precision flops as a flop is either being intentionally misleading or is confused about user expectations.On the other side, lots of scientific computing folks would rather have doubles, but IMO we should get with the times and learn to deal with less precision. It is fun, you get to make some trade-offs and you can see if your algorithms are really as robust as you expected. A free 2x speed up even on CPUs is pretty nice.&gt; and also whether the application can make use of the sparsity featureEh, I don’t like it. Flops are flops. Avoiding a computation exploiting sparsity is not a flop. If we want to take credit for flops not executed via sparsity, there’s a whole ecosystem of mostly-CPU “sparse matrix” codes to consider. Of course, GPUs have this nice 50% sparse feature, but nobody wants to compete against PARDISO or iterative solvers for really sparse problems, right? Haha.

Here&#x27;s my quick take.A top of the line Zen core is a powerful CPU with wide SIMD (AVX-512 is 16 lanes of 32 bit quantities), significant superscalar parallelism (capable of issuing approximately 4 SIMD operations per clock), and a high clock rate (over 5GHz). There isn&#x27;t a lot of confusion about what constitutes a &quot;core,&quot; though multithreading can inflate the &quot;thread&quot; count. See [1] for a detailed analysis of the Zen 5 line.A single Granite Ridge core has peak 32 bit multiply-add performance of about 730 GFLOPS.Nvidia, by contrast, uses the marketing term &quot;core&quot; to refer to a single SIMD lane. Their GPUs are organized as 32 SIMD lanes grouped into each &quot;warp,&quot; and 4 warps grouped into a Streaming Multiprocessor (SM). CPU and GPU architectures can&#x27;t be directly compared, but just going by peak floating point performance, the most comparable granularity to a CPU core is the SM. A warp is in some ways more powerful than a CPU core (generally wider SIMD, larger register file, more local SRAM, better latency hiding) but in other ways less (much less superscalar parallelism, lower clock, around 2.5GHz). A 4090 has 128 SMs, which is a lot and goes a long way to explaining why a GPU has so much throughput. A 1080, by contrast, has 20 SMs - still a goodly number but not mind-meltingly bigger than a high end CPU. See the Nvidia Ada whitepaper [2] for an extremely detailed breakdown of 4090 specs (among other things).A single Nvidia 4090 &quot;core&quot; has peak 32 bit multiply-add performance of about 5 GFLOPS, while an SM has 640 GFLOPS.I don&#x27;t know anybody who counts tensor cores by core count, as the capacity of a &quot;core&quot; varies pretty widely by generation. It&#x27;s almost certainly best just to compare TFLOPS - also a bit of a slippery concept, as that depends on the precision and also whether the application can make use of the sparsity feature.I&#x27;ll also note that not all GPU vendors follow Nvidia&#x27;s lead in counting individual SIMD lanes as &quot;cores.&quot; Apple Silicon, by contrast, uses &quot;core&quot; to refer to a grouping of 128 SIMD lanes, similar to an Nvidia SM. A top of the line M2 Ultra contains 76 such cores, for 9728 SIMD lanes. I found Philip Turner&#x27;s Metal benchmarks [3] useful for understanding the quantitative similarities and differences between Apple, AMD, and Nvidia GPUs.[1]: <a href="http:&#x2F;&#x2F;www.numberworld.org&#x2F;blogs&#x2F;2024_8_7_zen5_avx512_teardown&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.numberworld.org&#x2F;blogs&#x2F;2024_8_7_zen5_avx512_teardo...</a>[2]: <a href="https:&#x2F;&#x2F;images.nvidia.com&#x2F;aem-dam&#x2F;Solutions&#x2F;Data-Center&#x2F;l4&#x2F;nvidia-ada-gpu-architecture-whitepaper-V2.02.pdf" rel="nofollow">https:&#x2F;&#x2F;images.nvidia.com&#x2F;aem-dam&#x2F;Solutions&#x2F;Data-Center&#x2F;l4&#x2F;n...</a>[3]: <a href="https:&#x2F;&#x2F;github.com&#x2F;philipturner&#x2F;metal-benchmarks">https:&#x2F;&#x2F;github.com&#x2F;philipturner&#x2F;metal-benchmarks</a>

The answer to the leading question &quot;What’s the difference between a Zen core, a CUDA core, and a Tensor core?&quot; is not covered in Part 1, so you may want to wait if this interests you more than chip layouts.

you can calculate the area of the tensor and raytracing units by measuring+comparing die sizes between the nearest 20-series and 16-series chips. Contrary to the assumptions a lot of people made from the cartoon diagrams, it&#x27;s actually relatively small, together they make up approximately 18% of the cluster area and it&#x27;s below 10% of the chip as a whole. The area is roughly 2&#x2F;3rds tensor unit area and 1&#x2F;3 raytracing unit area, so RT is around 3% of total chip area and tensor is around 6%.<a href="https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;hardware&#x2F;comments&#x2F;baajes&#x2F;rtx_adds_195mm2_per_tpc_tensors_125_rt_07&#x2F;" rel="nofollow">https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;hardware&#x2F;comments&#x2F;baajes&#x2F;rtx_adds_1...</a>This could have changed somewhat in newer releases, but probably not too drastically, since NVIDIA has never really increased raw ray performance since the 20-series launch. And while there have been a few raytracing features around the edges, raster and cache have been bumped significantly too (notably, ampere got dual-issue fp32 pipelines... which didn&#x27;t really work out for NVIDIA that well either!) so honestly there&#x27;s a reasonable chance it&#x27;s slightly less in subsequent architectures.

The L2 really belongs to the core, a comparison without it does not make much sense.The GPU cores (in the classic sense, i.e. not what NVIDIA names as &quot;cores&quot;) also include cache memories and also local memories that are directly addressable.The only confusion is caused by the fact that first NVIDIA, and then ATI&#x2F;AMD too, have started to use an obfuscated terminology where they have replaced a large number of terms that had been used for decades in the computing literature with other terms.For maximum confusion, many terms that previously had clear meanings, like &quot;thread&quot; or &quot;core&quot;, have been reused with new meanings and ATI&#x2F;AMD has invented a set of terms corresponding to those used by NVIDIA but with completely different word choices.I hate the employees of NVIDIA and ATI&#x2F;AMD who thought that it is a good idea to replace all the traditional terms without having any reason for this.The traditional meaning of a thread is that for each thread there exists a distinct program counter a.k.a. instruction pointer, which is used to fetch and execute instructions from a program stored in the memory.The traditional meaning of a core is that it is a block that is equivalent with a traditional independent processor, i.e. equivalent with a complete computer minus the main memory and the peripherals.A core may have only one program counter, when it can execute a single thread at a time, or it may have multiple program counters (with associated register sets) when it can execute multiple threads, using either FGMT (fine-grained multithreading) or SMT (simultaneous multithreading).The traditional terms were very clear and they have direct correspondents in GPUs, but NVIDIA and AMD use other words for those instead of &quot;thread&quot; and &quot;core&quot; and they reuse the words &quot;thread&quot; and &quot;core&quot; for very different things, for maximum obfuscation. For instance, NVIDIA uses &quot;warp&quot; instead of &quot;thread&quot;, while AMD uses &quot;wavefront&quot; instead of &quot;thread&quot;. NVIDIA uses &quot;thread&quot; to designate what was traditionally named the body of a &quot;parallel for&quot; a.k.a. &quot;parallel do&quot; program structure (which when executed on a GPU or multi-core CPU is unrolled and distributed over cores, threads and SIMD lanes).

Or maybe a CUDA core versus one of Zen&#x27;s SIMD ports.

&gt; Each of the tiles on the CPU side is actually a Zen 4 core, complete with its dedicated L2 cache.Perhaps, it could be more interesting to compare without L2 cache.

It was a good read. I wonder what hot takes he&#x27;ll have in the second part if any.

To correct the example for the Epyc line, models appears to exist with 1 through 8 cores available except for 5.

&gt; if the intent was truly to try and max yield then there should be for Ryzen for example good 7 core versions with only 1 core that was found to be defective. Since no 7 core zens existThere are Zen processors that use 7 cores per CCD, e.g. Epyc 7663, 7453, 9634.The difference between Ryzen and Epyc is the I&#x2F;O die. The CCDs are the same so that&#x27;s presumably where they go.Another reason you might not see this on the consumer chips is that they have higher base clocks. If you have a CCD where one core is bad and another isn&#x27;t exactly bad but can&#x27;t hit the same frequencies as the other six, it doesn&#x27;t take a lot of difference before it makes more sense to turn off the slowest than lower the base clock for the whole processor. 6 x 4.7GHz is faster than 7 x 4.0GHz, much less 7 x 2.5GHz.In theory you could let that one core run at a significantly lower speed than the others, but there is a lot of naive software that will misbehave in that context. Whereas the base clock for the Epyc 9634 is 2.25GHz, because it has twelve 7-core CCDs so it&#x27;s nearly 300W, and doesn&#x27;t want to be nearly 1300W regardless of whether or not most of the cores could do &gt;4GHz.

Current Ryzen and EPYC processors have 8 core CCXs. The 6 core parts used to be as you described, but are now a single CCX. The Zen C dies have two CCXs, but they are still 8 core CCXs, and are always symmetrical in core count.The big exception is that the new Zen 5 Strix Point chip has a 4 core CCX for the non-C cores. I think the Zen 4 based Z1 has a similar setup but don&#x27;t remember and couldn&#x27;t quickly find the actual information to confirm.

It would be sort of cool if they could do direct to consumer sales with every core going at whatever its maximum speed is or turned off if to disrupted. But that&#x27;s not something you could do through existing distribution channels, everyone presumes a fairly limited number of SKUs.

I would guess that there is a desire to not create too many product tiers. I believe 6 core parts are made from 2 3-core CCXs, (rather than 4 and 2) so only one core is disabled per ccx.

That’s first sentence is a spectacular non-sequitur.

I refused to buy the so determined defective chips even if they represented better value because if the intent was truly to try and max yield then there should be for Ryzen for example good 7 core versions with only 1 core that was found to be defective. Since no 7 core zens exist, then at least some of the CPUs with 6 core CCDs have intentionally had 1 of the cores destroyed for reasons unknown, which could be to meet volume targets. If this is because for Ryzen the cores can only be disabled in pairs, then it boggles my mind that it would not be economic given the $ diff of tens to hundreds of dollars between the 6 and 8 core versions that is does not make sense to add the circuits to allow each core to be individually fused off and allow further product differentiation, especially considering how much effort and # of SKUs have been put forth with the frequency binning in AM4 (5700x, 5800, 5800x, 5800xt, etc.), rather than bigger market segmentation jumps.

Zen, CUDA, and Tensor Cores, Part I: The Silicon