SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

PaulHoule | 34 points

Extremely low bit quantization makes me curious why it is so effective.

Why is it better to run a bigger model with more parameters at lower accuracy?

Obviously more parameters are better, but why is that the case exactly? For that you need to understand that a transformer layer consists of the self attention mechanism followed by a bog standard feedforward network (usually multiple MLPs). Most of the parameters are here.

My personal theory is based on the fact that ReLU is the simplest possible activation function that works, yet all it does is clamp negative values to zero. How could a network use that for learning?

The answer to the question is quite simple. If you have weights w_i that are negative and take the sum = \sum_i w_i times x_i plus positive bias, then throw that into ReLU, you will get a boolean function that turns off when the negative sum is smaller than the bias. This means you can build a comparison operator using ReLU. Take it a few steps further and you can probably implement any arbitrary boolean function directly in each row of your MLP.

This means that most of the precision is only really needed during training, because you want a nicely continuosly differentiable function for gradient descent, but the model itself is mostly operating on a form of fuzzy boolean logic.

This means that the embedding length, basically the size of a token, plays a key role in the ability to encode these mostly binary concepts.

Bigger models have wider tokens. That's why bigger models with low bit quantization outperform smaller models with high bit quantization.

imtringued | 25 days ago

I feel like for many tasks, there's a certain "good enough" threshold that local small LMs can do as good but private and no cloud LLM is needed. I think the future is mostly on-device SLMs and their agentic coordination.

In that sense, a local agentic framework (js/ts based) would be soon very relevant.

mentalgear | 25 days ago