AI moves so fast that Vibe Coding still has a negative stigma attached to it, but even after 25 years of development, I'm not able to match the productivity of getting AI to implement the features I want. It's basically getting multiple devs to set out and go do work for you where you just tell them what you want and provide iterative feedback till they implement all the features you want, in the way you want and to fix all the issues you find along the way, which they can create tests and all the automated and deployment scripts for.
This is clearly the future of Software Development, but the models are so good atm that the future is possible now. I'm still getting used to and having to rethink my entire dev workflow for maximum productivity, and whilst I wouldn't unleash AI Agents on a decade old code base, all my new Web Apps will likely end up being AI-first unless there's a very good reason why it wouldn't provide a net benefit.
I used their $50 plan and with the previously offered Qwen3 coder 480B. While fast - none of the “supported” tools I tried were able to use it in a way that didn’t hit the per minute request limit in a few seconds. It was incredibly frustrating. For the record, I tried OpenCoder, VSCode, Quen Coder CLI, octofriend and a few others I don’t remember.
Fast forward to now, when GLM 4.6 has replaced Qwen3 coder in their subscription plan. My subscription was still active so I wanted to give this setup another shot. This time though, I decided to give Cline a try. I’ve got to say, I was very pleasantly surprised - it worked really well out of the box. I guess whatever Cline does behind the scenes is more conducive to Cerebra’s API. I used Claude 4.5 + Thinking for “Plan” mode and Cerebras/GLM 4.6 for “Act”.
The combo feels solid. Much better than GPT-5 Codex alone. I found codex to be very high quality but so godawful slow for long interactive coding sessions. The worst part is I cannot see what it’s “thinking” to stop it in its tracks when it’s going in the wrong direction.
In an essence, Cerebras + GLM 4.6 feels like Grok Fast 1 on steroids. Just couple it with a frontier + thinking model for planning (Claude 4.5/GPT-5/Gemini Pro 2.5).
One caveat: sometimes the Cerebras API starts choking “because of high demand” which has nothing to do with hitting subscription limits. Just an FYI.
Note: For the record, I was coding on a semi-complex Rust application tuned for low-latency mix of IO + CPU workload. The application is multi-threaded and makes extensive use of locking primitives and explicit reference counting (Arc). All models were able to handle the code really well given the constraints.
Note2: I am also evaluating Synthetic's (synthetic.new) open-source model inference subscription and I like it a lot. There's a large number of models to choose from, including gpt-oss-120 and their usage limits are very very generous. To the point that I don't think I will ever hit them.
I have been an AI-coding skeptic for some time. I always acknowledged LLMs as useful for solving specific problems and making certain things possible that weren't possible before. But I've not been surprised to see AI fail to live up to the hype. And I never had a personally magical moment - an experience that shifted my perspective à la the peak end rule.
I've been using GLM 4.6 on Cerebras for the last week or so, since they began the transition, and I've been blown away.
I'm not a vibe coder; when I use AI coding tools, they're in the hot path. They save me time when whipping up a bash script and I can't remember the exact syntax, or for finding easily falsifiable answers that would otherwise take me a few minutes of reading. But, even though GLM 4.6 is not as smart as Sonnet 4.5, it is smart enough. And because it is so fast on Cerebras, I genuinely feel that it augments my own ability and productivity; the raw speed has considerably shifted the tipping point of time-savings for me.
YMMV, of course. I'm very precise with the instructions I provide. And I'm constantly interleaving my own design choices into the process - I usually have a very clear idea in my mind of what the end result should look like - so, in the end, the code ends up how I would have written it without AI. But building happens much faster.
No affiliation with Cerebras, just a happy customer. Just upgraded to the $200/mo plan - and I'll admit that I was one that scoffed when folks jumped on the original $200/mo Claude plan. I think this particular way of working with LLMs just fits well with how I think and work.
I wanted to try GLM 4.6 through their API with Cline, before spending the $50. But I'm getting hit with API limits. And now I'm noticing a red banner "GLM4.6 Temporarily Sold Out. Check back soon." at cloud.cerebras.ai. HN hug of death, or was this there before?
Was able to sign up for the Max plan & start using it via opencode. It does a way better job than Qwen3 Coder in my opinion. Still extremely fast, but in less than 1 hour I was able to use 7M input tokens, so with a single agent running I would be able easily to pass that 120M daily token limit. The speed difference between Claude Code is significant though - to the point where I'm not waiting for generation most of the time, I'm waiting for my tests to run.
For reference, each new request needs to send all previous messages - tool calls force new requests too. So it's essentially cumulative when you're chatting with an agent - my opencode agent's context window is only 50% used at 72k tokens, but Cerebra's tracking online shows that I've used 1M input tokens and 10k output tokens already.
Been using Cerebras for quite a while now, previously with their Qwen3 Coder and now GLM 4.6, overall the new model feels better at tool calls and code in general. Fewer tool call failures with RooCode (should also apply to Cline and others too), but obviously still not perfect.
Currently on the 50 USD tier, very much worth the money, am kinda considering going for the 200 USD tier, BUT GPT-5 and Sonnet 4.5 and Gemini 2.5 Pro still feel needed occasionally, so it'd be stupid to go for the 200 USD tier and not use it fully and still have to pay up to around 100 USD for tokens in the other models per month. Maybe that will change in the future, when dealing with lots of changes (e.g. needing to make a component showcase of 90 components, but with enough differences between then to make codegen unviable) Cerebras is already invaluable.
Plus the performance actually makes iterating faster, to a degree where I believe that other models should also eventually run this fast. Oddly enough, the other day their 200 USD plan showed as "Sold out", maybe they're scaling up the capacity gradually. I really hope they never axe the Code plans, they literally have no competition for this mode of use. Maybe they'll also have a 100 USD plan some day, one can hope, but maybe offering just the 200 plan is better from an upsell perspective for them.
Oh also, when I spill over my daily 24M limits, please let me use the Pay2Go thing on top of that, if instead of 24M tokens some day I need 40M, I'd pay for those additional ones.
I have created a "semi-interactive" AI coding workflow.
I write what I want, the LLM responds with edits, my 100 lines of Python implement them in the project. It can edit any number of files in one LLM call, which is very nice (and very cheap and fast).
I tested this with Kimi K2 on Groq (also 1000 tok/s) and very impressed.
I want to say this is the best use case for fast models -- that frictionlessness in your workflow -- though it burns tokens pretty fast working like that! (Agentic is even more nuts with how fast it burns tokens on fast models, so the $50 is actually pretty great value.)
I don't know about GLM 4.6, some have said they are bench-maxing so I kinda lost my interest in trying them out, does it live up to its reputation? Is it that good like Sonnet 4.5?
If they don't quantize the model, how do they achieve these speeds? Groq also says they don't quantize models (and I want to believe them) but we literally have no way to prove they're right.
This is important because their premium $50 (as opposed to $20 on Claude Pro or ChatGPT Plus) should be justified by the speed. GLM 4.6 is fine but I don't think it's still at the GPT-5/Claude Sonnet 4.5 level, so if I'm paying $50 for it on Cerebras it should be mainly because of speed.
What kind of workflow justifies this? I'm genuinely curious.
This is more evidence that Cognition's SWE-1.5 is a GLM-4.6 finetune
Here's a customer of the $200 max plan for 2 months. I fell in love with the Qwen3 Coder 480B model, Q3C, that was fast, twice the speed of GLM. GLM 4.6 is just meh, I mean, way faster than competitors, and practically at Sonnet 4.x level in coding and tool use, but not a life-changing difference.
Yes, Qwen3 made more mistakes than GLM, around 15% more in my quick throwaway evals, but it was a more professional model overall, more polished in some aspects, better with international languages, and being non-reasoning, ideal for a lot of tasks through the API that could be ran instantaneously. I think the Qwen line of models is a more consistent offering, with other versions of the model for 32B and VL, now a 80B one, etc. I guess the problem was that Qwen Max was closed source, signalling that Qwen may not have a way forward for Cerebras to evolve. GLM 4.6 covers precisely that hole. Not that Cerebras is a model provider of any kind, their service levels are "buggy" (right now it's been down for 1h and probably won't be fixed until California wakes up at 9am PST). So it does feel like we are not the customers, but the product, a marketing stunt for them to get visibility for their tech.
GLM feels like they (Z.ai) are just distilling whatever they can get into it. GLM switches to Chinese sometimes, or just cuts off. It does have a bit of more "intelligence" than Q3C, but not enough to say it solves the toughest problems. Regardless, for tough nuts to crack I use my Codex Plus plan.
Ex: In one of my evals, it took 15 turns to solve an issue using Cerebras Q3C. I took 12 turns with GLM, but overall GLM takes 2x the time, so instead of doing a full task from zero-to-commit in say 15 minutes, it takes 24 minutes.
In another eval (Next.js CSS editing), my task with Q3C coder was done in 1:30 minutes. GLM 4.6 took 2:24. The same task in Codex took 5:37 minutes, with maybe 1 or 2 turns. Codex DX is that of working unattended: prompt it and go do something else, there's a good chance it will get it right after 0, 1 or 2 nudges. With CC+Cerebras it's a completely different DX, given the speed it feels just like programming, but super-fast. Prompt, read the change, accept (or don't), accept, accept, accept, test it out, accept, prompt, accept, interrupt, prompt, accept, and 1:30 min later we're done.
Like I said I use Claude Code + a proxy (llmux). The coding agent makes a HUGE difference, and CC is hands-down the best agent out there.
At what quantization? And if it is in fact quantized below fp8, how is the performance impacted on all the various benchmarks?
1000 tokens/s is pretty fancy. I just wonder how sustainable the pricing is or if they are VC-fueled drug dealers trying to convert us into AI-coholics...
It is definitely fun playing with these models at these speeds. The question is just how far from real pricing is 500M tokens for $50?
Either way the LLM usage will grow for some time to come and so will grow energy usage. Good times for renewables and probably fusion and fission.
Selling shovels in a gold rush was always reliable business. Cerebras is only rated at $8.1B as of one month ago. Compared to Nvidia that seems pocket change.
I'm curious to know what the cost is to switch contexts. Pure performance is amazing, but how long does it take to get a system going, to load the model and build context? What systems cans witch contexts while keeping the model non-destructively vs when is executing destructive?
I have a lot of questions about how models are run at scale; so curious to know more. With such a massive wafer as chip as Cerebras, it feels like perhaps switching might be even more consuming. Or maybe there's some brilliant strategy to have multiple contexts all loaded that it can flip between! Inventorying & using so much ram so spread out is it's own challenge!
So basically "change this in the UI" and you see it happen almost real time.
I find the fast models good for rapidly iterating UI changes with voice chat. Like "add some padding above the text box" or "right align the button". But I find the fast models useless for deep coding work. But a fast model has its place. Not $50/month though. Cursor has Compose 1 and Grok Code Fast for free. Not sure what $50/month gets me that those don't. I liked the stealth supernova model a lot too.
I want to use Cerebaras but it’s just not production ready. I will root for them on sideline for now.
50 dollars month cerebras code plan, first with qwen-420, now with glm, is my secret weapon.
Stalin used to say that in war "quantity has a quality all its own". And I think that in terms of coding agents, speed is quality all its own too.
Maybe not for blind vibe coding, but if you are a developer, and is able to understand the code the agent generates and change it, the fast feedback of fast inference is a game changer. I don't care if claude is better than GLM 4.6, fast iteractions are king for me now.
It is like moving from DSL to gigabit fiber FTTH
It would be nice if there was more information provided on that page. I assume this is just the output token generation speed. Is it using speculative decoding to get to 1000 tokens/sec? Is there lossy quantization being used to speed things up? I tend to think the number of tokens per second a model can generate to be relatively low on the list of things I care about, when things like model/inference quality and harness play a much bigger role in how I feel about using a coding agent.
It’s just amazing to have a reliable model at the speed of light. Was waiting for such a great model for a long time!
Vibe Slopping at 1000 tokens per second
Unfortunately for me, the models on Cerebras weren’t as good as Claude Code. Speedy but I needed to iterate more. Codex is trustworthy and slow. Claude is better at iterating. But none of the Cerebras models at the $50 tier were worth anything for me. They would have been something if they’d just come out but we have these alternatives now.
I have been using Z.ai's (creators of GLM) "Coding Plan" with GLM-4.6. $3/month and 3x higher limits than Claude Pro, they say.
(I have both and haven't run into any limits yet, so I'm probably not a very heavy user.)
I'm quite impressed with the model. I have been using GLM-4.6 in Claude Code instead of Sonnet, and finding it fine for my use cases. (Simple scripting and web stuff.)
(Note: Z.ai's GLM doesn't seem to support web search (fails) or image recognition (hallucinates). To fix that I use Claude Code Router and hooked up those two features to Gemini (free) instead.)
I find that Sonnet produces much nicer code. I often find myself asking Sonnet to clean up GLM's code. More recently, I just got the Pro plan for Claude so I'm mostly just using Sonnet directly now. (Haven't had the rate limits yet but we'll see!)
So in my experience if you're not too fussy you can currently get "80% of Claude Code" for like $3/month, which is pretty nuts.
GLM also works well in Charm Crush, though it seems to be better optimized for Claude Code (I think they might have fine tuned it.)
---
I have tested Kimi K2 at 1000 tok/s via OpenRouter, and it's bloody amazing, so I imagine this supercharged GLM will be great too. Alas, $50!