Cerebras Code
> running at speeds of up to 2,000 tokens per second, with a 131k-token context window, no proprietary IDE lock-in, and no weekly limits!
I was excited, then I read this:
> Send up to 1,000 messages per day—enough for 3–4 hours of uninterrupted vibe coding.
I don't mind paying for services I use. But it's hard to take this seriously when the first paragraph claim is contradicting the fine prints.
If you would like to try this in a coding agent (we find the qwen3-coder model works really well in agents!), we have been experimenting with Cerebras Code in Sketch. We just pushed support, so you can run it with the latest version, 0.0.33:
brew install boldsoftware/tap/sketch
CEREBRAS_API_KEY=...
sketch --model=qwen3-coder-cerebras -skaband-addr=
Our experience is it seems overloaded right now, to the point where we have better results with our usual hosted version: sketch --model=qwen
Some users who signed up for pro ($50 p.m.) are reporting further limitations than those advertised.
>While they advertise a 1,000-request limit, the actual daily constraint is a 7.5 million-token limit. [1]
Assumes an average of 7.5k/request whereas in their marketing videos they show API requests ballooning by ~24k per request. Still lower than the API price.
[1] https://old.reddit.com/r/LocalLLaMA/comments/1mfeazc/cerebra...
2k tokens/second is insane. While I'm very much against vibe coding, such performance essentially means you can get near-github copilot level speed with drastically better quality.
For in-editor use that's game changing.
Windsurf also has Cerebras/Qwen3-Coder. 1000 user messages per month for $15
How does context buildup work for the code generating machines generally ? Do the programs just use human notes + current code directly ? Are there some specific ranking steps that need to be done ?
I was waiting for more subscription base services to pop up to compete with the influence provider on a commodities level.
I think a lot more companies will follow suit and the competition will make pricing much better for the end user.
congrats on the launch Cerebras team!
Does it work with claude-code-router? I was getting API errors this week trying to use qwen3 Cerebras through OpenRouter with Claude code router.
Their hardware is incredible. Why aren’t more investors lining up for this in this environment?
FYI, you are probably going to use up your tokens because there's a total limit of tokens per day, so in about 300 requests it's feasible to use it all up. See https://www.reddit.com/r/LocalLLaMA/comments/1mfeazc/cerebra...
I'm so excited to see a real competitor to Claude Code! Gemini CLI, while decent, does not have a $200/month pricing model and they charge per API access - Codex is the same. I'm trying to get into the https://cloud.cerebras.ai/ to try the $50/month plan but I can't even get in.
At $200/month the comparable should be Opus 4 not Sonnet 4.
Attn: Cerebras
Any attempt to deal with "<think>" in the code gets it replaced with "<tool_call>".
Both in inference.cerebras.ai chat and API.
Same model on chat.qwen.ai doesn't do it.
My understanding is that the coding agents people use can be modified to plug into any LLM provider's API?
The difference here seems to be that Cerebras does not appear to have Qwen3-Coder through their API! So now there is a crazy fast (and apparently good too?) model that they only provide if you pay the crazy monthly sub?
I'm finding myself switching between subscriptions to ChatGPT, T3 Chat, DeepSeek, Claude Code etc. Their subscription models aren't compatible with making it easy to take your data with you. I wish I could try this out and import all my data.
I've been waiting on this for a LONG time. Integration with Cursor when Cerebras released their earlier models was patchy at best, even through openrouter. It's nice to finally see official support, although I'm a bit worried about long-term the time for bash mcp calls ending up dominating.
Still, definitely the right direction!
EDIT: doesn't seem like anything but a first-party api with a monthly plan.
Super curious to see some comparisons to claude code. Especially Opus, since they're primarily comparing it to Sonnet in that graph.
I use regular cerebras for plan stage in cline, so I’m very excited to try this out
Is this available as cline/roo-code integration? I think it might be on openrouter too.
For those that have tried this, what kind of time-to-first-token latency are you seeing?
Groq also probably has this in the works. Fun times.
What are the token prices?
They should just host all the latest open source models FTW.
It says it works with your favorite IDE-- How do you (the reader) plan to use this? I use Cursor, but I'm not sure if this replaces my need to pay for Cursor, or if I need to pay for Cursor AND this, and add in the LLM?
Or is VS code pretty good at this point? Or is there something better? These are the only two ways I'd know how to actually consume this with any success.
This has to be a monstrous money loser.
If they can maintain this pricing level, and if Qwen3‑Coder is as good as people say then they will have an enormous hit on their hands. A massive money losing hit, but a hit.
Very interesting!
PS: Did they reduce the context window, it looks like it.
How is this even possible?
[dead]
[dead]
[flagged]
Tried this out with Cline using my own API key (Cerebras is also available as a provider for Qwen3 Coder via via openrouter here: https://openrouter.ai/qwen/qwen3-coder) and realized that without caching, this becomes very expensive very quickly. Specifically, after each new tool call, you're sending the entire previous message history as input tokens - which are priced at $2/1M via the API just like output tokens.
The quality is also not quite what Claude Code gave me, but the speed is definitely way faster. If Cerebras supported caching & reduced token pricing for using the cache I think I would run this more, but right now it's too expensive per agent run.