AMD now has more compute on the top 500 than Nvidia

rbanffy | 195 points

As someone who worked in the ML infra space: Google, Meta, XAI, Oracle, Microsoft, Amazon have clusters that perform better than the highest performing cluster on Top500. They don't submit because there's no reason to, and some want to keep the size of their clusters a secret. They're all running Nvidia. (Except Google, who uses TPUs and Nvidia.)

> El Capitan – we don’t yet know how big of a portion yet as we write this – with 43,808 of AMD’s “Antares-A” Instinct MI300A devices

By comparison XAI announced that they have 100k H100s. MI300A and H100s have roughly similar performance. Meta says they're training on more than 100k H100s for Llama-4, and have the equivalent of 600k H100s worth of compute. (Note that compute and networking can be orthogonal).

Also, Nvidia B200s are rolling out now. They offer 2-3x the performance of H100s.

ipsum2 | 4 days ago

After skimming the article, I'm confused -- where exactly is the headline being pulled from?

If you look at the table toward the bottom, no matter how you slice it, Nvidia has 50% of the total cores, 50% of the total flops, and 90% of the total systems among the Top 500, while AMD has 26% of the total cores, 27.5% of the total flops, and 7% of the total systems.

Is it a matter of newly-added compute?

> This time around, on the November 2024 Top500 rankings, AMD is the big winner in terms of adding capacity to the HPC base.

vitus | 4 days ago

I'm sure there is also a lot not on the Top500. I've got enough AMD MI300x compute for about 140th position, but haven't submitted numbers.

latchkey | 4 days ago

There is another widespread common factor among the top machines. A large majority are based on HPE Slingshot networking (7 out of top 10 by my count).

Without blindingly fast, otherwise blinding numerical performance dims quite a lot. This is why the Cerebras numbers on heavy numerical problems are competitive up to a pretty severe ceiling. Below that point, their on wafer interconnects suffice, above it they cannot scale the data communications bandwidth necessary.

ted_dunning | 3 days ago

layperson with no industry knowledge, but it seems like nvidia's CUDA moat will fall in the next 2-5 years. It seems impossible to sustain those margins without competition coming in and getting a decent slice of the pie

pie420 | 4 days ago

Why the focus on AMD and Nvidia? It really isn't that hard to design a large number of ALU blocks into some silicon IP block and make them work together efficiently.

The real accomplishment is fabricating them.

amelius | 4 days ago

It does not matter. AMD is shit when it comes to low-level processing, their algos are stuck that go nowhere. Nvidia is killing it. There is a reason why Zookerberg ordered billions in GPUs from Nvidia and not from AMD.

nwgo | 4 days ago