HNPWA with Next.js

Hardware Acceleration of LLMs: A comprehensive survey and comparison

matt_d | 266 points

This paper is light on background so I’ll offer some additional context:

As early as the 90s it was observed that CPU speed (FLOPs) was improving faster than memory bandwidth. In 1995 William Wulf and Sally Mckee predicted this divergence would lead to a “memory wall”, where most computations would be bottlenecked by data access rather than arithmetic operations.

Over the past 20 years peak server hardware FLOPS has been scaling at 3x every 2 years, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively.

Thus for training and inference of LLMs, the performance bottleneck is increasingly shifting toward memory bandwidth. Particularly for autoregressive Transformer decoder models, it can be the dominant bottleneck.

This is driving the need for new tech like Compute-in-memory (CIM), also known as processing-in-memory (PIM). Hardware in which operations are performed directly on the data in memory, rather than transferring data to CPU registers first. Thereby improving latency and power consumption, and possibly sidestepping the great “memory wall”.

Notably to compare ASIC and FPGA hardware across varying semiconductor process sizes, the paper uses a fitted polynomial to extrapolate to a common denominator of 16nm:

> Based on the article by Aaron Stillmaker and B.Baas titled ”Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7nm,” we extrapolated the performance and the energy efficiency on a 16nm technology to make a fair comparison

But extrapolation for CIM/PIM is not done because they claim:

> As the in-memory accelerators the performance is not based only on the process technology, the extrapolation is performed only on the FPGA and ASIC accelerators where the process technology affects significantly the performance of the systems.

Which strikes me as an odd claim at face value, but perhaps others here could offer further insight on that decision.

Links below for further reading.

https://arxiv.org/abs/2403.14123

https://en.m.wikipedia.org/wiki/In-memory_processing

http://vcl.ece.ucdavis.edu/pubs/2017.02.VLSIintegration.Tech...

refibrillator | 4 months ago

I've always been partial to systolic arrays. I iterated through a bunch of options over the past few decades, and settled upon what I think is the optimal solution, a cartesian grid of cells.

Each cell would have 4 input bits, 1 each from the neighbors, and 4 output bits, again, one to each neighbor. In the middle would be 64 bits of shift register from a long scan chain, the output of which goes to 4 16:1 multiplexers, and 4 bits of latch.

Through the magic of graph coloring, a checkerboard pattern would be used to clock all of the cells to allow data to flow in any direction without preference, and without race conditions. All of the inputs to any given cell would be stable.

This allows the flexibility of an FPGA, without the need to worry about timing issues or race conditions, glitches, etc. This also keeps all the lines short, so everything is local and fast/low power.

What it doesn't do is be efficient with gates, nor give the fastest path for logic. Every single operation happens effectively in parallel. All computation is pipelined.

I've had this idea since about 1982... I really wish someone would pick it up and run with it. I call it the BitGrid.

mikewarot | 4 months ago

https://arxiv.org/pdf/2406.08413 Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

fulafel | 4 months ago

I'd love to watch a LLM run in WebGL where everything is Textures. Would be neat to visually see the difference in architectures.

koolala | 4 months ago

Memory move is the bottleneck these days, thus the expensive HBM, Nvidia's design is also memory-optimized since it's the true bottleneck chip wise and system wise.

synergy20 | 4 months ago

Could a FPGA + ASICs + in-mem hybrid architecture have any role to play in scaling/flexibility? Each one has its own benefits (e.g., FPGAs for flexibility, ASICs for performance, in-memory for energy efficiency), so could a hybrid approach integrating each to juice LLM perf even further?

next_xibalba | 4 months ago

In-memory sounds like the way to go not just in terms of performance, but in that it makes no sense to build an ASIC or program an FPGA for a model that will most likely be obsolete in a few months at best if you're lucky.

moffkalast | 4 months ago

There was a paper about LLM running on same power as a light bulb.

https://arxiv.org/abs/2406.02528

https://news.ucsc.edu/2024/06/matmul-free-llm.html

smusamashah | 4 months ago

I'm unfamiliar; in this context is "in-memory" specialized hardware that combines CPU+RAM?

yjftsjthsd-h | 4 months ago

Is there a "nice" way to read content on Arxiv?

Every time I land on that site I'm so confused / lost in it's interface (or lack there of) I usually end up leaving without getting to the content.

smcleod | 4 months ago

Curious if anyone is making AccelTran ASICs?

jumploops | 4 months ago

The values (namely the FPGAs) should have been normalized also by price.

DrNosferatu | 4 months ago

this explains the success of Groq's ASIC-powered LPUs. LLM inference on Groq Cloud is blazingly fast. Also, the reduction in energy consumption is nice.

fsndz | 4 months ago

[deleted]

| 4 months ago

[deleted]

| 4 months ago

[deleted]

| 4 months ago

[deleted]

| 4 months ago

[deleted]

| 4 months ago