Nice. Seems like i cannot run this on my Apple silicon M chips right?
Great work! Can this technique also be used to run image diffusion models on lower VRAM GPUs?
what is the throughput for gpt-oss, 1 token every 2 seconds is really slow, but understandable because you are moving cache to disk
There's one more exciting thing about Qwen3-next (except, efficient MoE architecture and fast linear attention) - MTP (Multi token prediction). It is the additional layer that allows generating more tokens without the need to go through the model again. I'm trying to make it work, but unsuccesfully yet. Maybe someone could help me with it - https://github.com/Mega4alik/ollm/blob/dev/src/ollm/qwen3_ne... (dev brunch). Take a look
Why even bother with the GPU at that point? CPU would be just as fast if you're bottlenecked on SSD bandwidth.
How dramatically does this shorten lifespan of SSDs?