AI’s compute fragmentation: what matrix multiplication teaches us

tzhenghao | 122 points

There's hope in intermediate representations, in OpenXLA:

https://opensource.googleblog.com/2023/03/openxla-is-ready-t...

> OpenXLA is an open source ML compiler ecosystem co-developed by AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face, Intel, Meta, and NVIDIA. It enables developers to compile and optimize models from all leading ML frameworks for efficient training and serving on a wide variety of hardware

BenoitP | a year ago

> Hand-written assembly kernels don’t scale!

I used to think this. And I think, in theory, it is true. But the fact of the matter is, modern ML just doesn't use that many kernels. Every framework uses the same libraries (BLAS) and every library uses the same basic idea (maximally saturate FMA-like units).

Large language models are being run natively on commodity hardware with code written from scratch within days of their release (e.g. llama.cpp).

From a conceptual standpoint, it's really easy to saturate hardware in this domain. It's been pretty easy since 2014 when convolutions were interpreted as matrix multiplications. Sure, the actual implementations can be tricky, but a single engineer (trained in it) can get that done for a specific hardware in a couple months.

Of course, the interesting problem is how to generalize kernel generation. I spent years working with folks trying to do just that. But, in retrospect, the actual value add from a system that does all this for you is quite low. It's a realization I've been struggling to accept :'(

brrrrrm | a year ago

> "Think about it: how can a small number of specialized experts, who hand write and tune assembly code, possibly scale their work to all the different configurations while also incorporating their work into all the AI frameworks?! It’s simply an impossible task."

By committing it to a common library that a lot of people use? There are already multiple libraries with optimized matrix multiplication.

This is also exaggerating the expertise required. I'm not going to claim it's trivial, but you can genuinely google "intel avx-512 matrix multiplication", and find both papers and Intel samples.

nitwit005 | a year ago

> "Think about it: how can a small number of specialized experts, who hand write and tune assembly code, possibly scale their work to all the different configurations while also incorporating their work into all the AI frameworks?! It’s simply an impossible task."

Naively, I wonder if this is the kind of problem that AI itself can solve, which is a rather singularity-approaching concept. Maybe there's too much logic involved and not enough training data on different configurations for that to work? A bit spooky however, the thought of self-bootstrapping AI.

photochemsyn | a year ago

My take: optimizing matrix multiplication is not hard on modern architecture if you have the right abstraction. The code itself could be fragmented across different programming models, which is true, but the underlying techniques are not hard for a 2nd/3rd year undergrad to understand. There are only a few important ones on GPU: loop tiling, pipelining, shared memory swizzle, memory coalescing. A properly designed compiler can allow developers to optimize matmuls within 100 lines of code.

junrushao1994 | a year ago

The article seems to be missing a conclusion.

Writing assembly doesn’t scale across lots of platforms? Sure… the solution for matrix multiplication is to use the vendor’s BLAS.

If the vendor can’t at least plop some kernels into BLIS they don’t want you to use their platform for matmuls… don’t fight them.

bee_rider | a year ago
[deleted]
| a year ago

I really like the Neanderthal library because it does a pretty good job of abstracting over Nvidia, AMD, and Intel hardware to provide matrix operations in an extremely performant manner for each one with the same code. Dragan goes into a lot of detail about the hardware differences. His library provides some of the fastest implementations of using the given hardware too, it's not a hand-wavy, half-baked performance abstraction, the code is really fast. https://github.com/uncomplicate/neanderthal

gleenn | a year ago

Surely one solution is for the AI frameworks to each themselves understand the operating environment and choose the best implementation at run-time, much like the way they currently do.

bigbillheck | a year ago

Yeah well tell all that to Nvidia, who very much likes the fragmentation and wants to keep things that way.

brucethemoose2 | a year ago

> performance has become increasingly constrained by memory latency, which has grown much slower than processing speeds.

Sounds like they would oddly prefer memory latency to grow as least as fast as processing speeds, which would be terrible. Obviously, memory latency actually decreased, just not enough.

So it seems likely they made a mistake and actually meant that memory latency has decreased slower than processing speeds have increased, in other words, that it is not memory latency but memory random access throughput (which in rough approximation is about proportional to the inverse of memory latency) that has grown much slower than processing speeds.

EntrePrescott | a year ago

Chad Jarvis is an AI-generated name if I’ve ever heard one

b34r | a year ago

A cool mission

version_five | a year ago

No, they compute spectra.

adamnemecek | a year ago