I've always been partial to systolic arrays. I iterated through a bunch of options over the past few decades, and settled upon what I think is the optimal solution, a cartesian grid of cells.
Each cell would have 4 input bits, 1 each from the neighbors, and 4 output bits, again, one to each neighbor. In the middle would be 64 bits of shift register from a long scan chain, the output of which goes to 4 16:1 multiplexers, and 4 bits of latch.
Through the magic of graph coloring, a checkerboard pattern would be used to clock all of the cells to allow data to flow in any direction without preference, and without race conditions. All of the inputs to any given cell would be stable.
This allows the flexibility of an FPGA, without the need to worry about timing issues or race conditions, glitches, etc. This also keeps all the lines short, so everything is local and fast/low power.
What it doesn't do is be efficient with gates, nor give the fastest path for logic. Every single operation happens effectively in parallel. All computation is pipelined.
I've had this idea since about 1982... I really wish someone would pick it up and run with it. I call it the BitGrid.
Related:
https://arxiv.org/pdf/2406.08413 Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference
I'd love to watch a LLM run in WebGL where everything is Textures. Would be neat to visually see the difference in architectures.
Memory move is the bottleneck these days, thus the expensive HBM, Nvidia's design is also memory-optimized since it's the true bottleneck chip wise and system wise.
Could a FPGA + ASICs + in-mem hybrid architecture have any role to play in scaling/flexibility? Each one has its own benefits (e.g., FPGAs for flexibility, ASICs for performance, in-memory for energy efficiency), so could a hybrid approach integrating each to juice LLM perf even further?
In-memory sounds like the way to go not just in terms of performance, but in that it makes no sense to build an ASIC or program an FPGA for a model that will most likely be obsolete in a few months at best if you're lucky.
There was a paper about LLM running on same power as a light bulb.
I'm unfamiliar; in this context is "in-memory" specialized hardware that combines CPU+RAM?
Is there a "nice" way to read content on Arxiv?
Every time I land on that site I'm so confused / lost in it's interface (or lack there of) I usually end up leaving without getting to the content.
Curious if anyone is making AccelTran ASICs?
The values (namely the FPGAs) should have been normalized also by price.
this explains the success of Groq's ASIC-powered LPUs. LLM inference on Groq Cloud is blazingly fast. Also, the reduction in energy consumption is nice.
This paper is light on background so I’ll offer some additional context:
As early as the 90s it was observed that CPU speed (FLOPs) was improving faster than memory bandwidth. In 1995 William Wulf and Sally Mckee predicted this divergence would lead to a “memory wall”, where most computations would be bottlenecked by data access rather than arithmetic operations.
Over the past 20 years peak server hardware FLOPS has been scaling at 3x every 2 years, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively.
Thus for training and inference of LLMs, the performance bottleneck is increasingly shifting toward memory bandwidth. Particularly for autoregressive Transformer decoder models, it can be the dominant bottleneck.
This is driving the need for new tech like Compute-in-memory (CIM), also known as processing-in-memory (PIM). Hardware in which operations are performed directly on the data in memory, rather than transferring data to CPU registers first. Thereby improving latency and power consumption, and possibly sidestepping the great “memory wall”.
Notably to compare ASIC and FPGA hardware across varying semiconductor process sizes, the paper uses a fitted polynomial to extrapolate to a common denominator of 16nm:
> Based on the article by Aaron Stillmaker and B.Baas titled ”Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7nm,” we extrapolated the performance and the energy efficiency on a 16nm technology to make a fair comparison
But extrapolation for CIM/PIM is not done because they claim:
> As the in-memory accelerators the performance is not based only on the process technology, the extrapolation is performed only on the FPGA and ASIC accelerators where the process technology affects significantly the performance of the systems.
Which strikes me as an odd claim at face value, but perhaps others here could offer further insight on that decision.
Links below for further reading.
https://arxiv.org/abs/2403.14123
https://en.m.wikipedia.org/wiki/In-memory_processing
http://vcl.ece.ucdavis.edu/pubs/2017.02.VLSIintegration.Tech...