How to optimize a CUDA matmul kernel for cuBLAS-like performance (2022)

mpweiher | 103 points

Another point to consider here is that this project of writing a cuBLAS level GEMM kernel becomes much more challenging if you are doing it with fp16, and are thus competing with the cuBLAS kernels that use tensor cores. The (theoretical) arithmetic throughput of tensor cores is ~8x higher as compared to fp32 math on the Turing arch, I dont know off the top of my head but I think this ratio is the same or greater for Ampere/Hopper tensor cores.

This makes the project proportionally harder in my opinion because you need to be that much more efficient with moving data through the memory hierarchy. With tensor cores, to get anywhere close to cuBLAS, you need to start with something like the most efficient kernel in simon's article, and then do stuff like shared memory swizzling, async global memory copies, double buffering, and writing a really efficient kernel epilogue to accumulate the C matrix into the product.

I came across this article a while ago and it inspired me to take a stab at this^, and as of now I have gotten to ~80% of the cuBLAS tensor core performance where the kernel is mostly compute bound, and I am close to giving up on the last ~20%, because I think I may need to write the inner loop in SASS to make sure the instruction mix between shared memory loads, mma instructions, and synchronizations is perfectly balanced so that none of the hardware pipelines get overloaded (see link below), and I have enough compassion for myself to not spend my free time doing stuff like that :). There are also certain things implemented in CUTLASS that seem important (look up serpentine traversal) but NVIDIA engineers wont talk about the hardware details required to understand why this helps.

Article on this is forthcoming

https://github.com/NervanaSystems/maxas/wiki/SGEMM

aaa370 | a month ago

Note that this is from 2022.

My guess is that people nowadays are gradually moving away from raw CUDA programming and moving towards things like Triton etc, and you won't be focusing on pure GEMM since you tend to do some fusion.

The Triton tutorial claims their performance is on par with cuBLAS.

https://triton-lang.org/main/getting-started/tutorials/03-ma...

flakiness | a month ago

In other discussion here, people asserted that a CUDA replacement was unworkable because you couldn't replace Nvidia's CuBLAS implementation. I'm not qualified to say whether this would give info for constructing an adequate replacement but I'd interested in people's opinions.

joe_the_user | a month ago