Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput

anuarsh | 118 points

How dramatically does this shorten lifespan of SSDs?

ydlr | 8 hours ago

Nice. Seems like i cannot run this on my Apple silicon M chips right?

cahaya | a day ago

Great work! Can this technique also be used to run image diffusion models on lower VRAM GPUs?

addandsubtract | a day ago

what is the throughput for gpt-oss, 1 token every 2 seconds is really slow, but understandable because you are moving cache to disk

mendeza | a day ago

There's one more exciting thing about Qwen3-next (except, efficient MoE architecture and fast linear attention) - MTP (Multi token prediction). It is the additional layer that allows generating more tokens without the need to go through the model again. I'm trying to make it work, but unsuccesfully yet. Maybe someone could help me with it - https://github.com/Mega4alik/ollm/blob/dev/src/ollm/qwen3_ne... (dev brunch). Take a look

anuarsh | a day ago

Why even bother with the GPU at that point? CPU would be just as fast if you're bottlenecked on SSD bandwidth.

aappleby | a day ago