Cerebras Blog·Infra·5d ago·~3 min read

Cerebras Brings Trillion Parameter Inference to Enterprises with Kimi K2.6 >>

Cerebras is now running Kimi K2.6 — the leading trillion parameter open-weight model — in enterprise customer trials. Widely recognized as the leader in fast inference, Cerebras has set benchmarks across numerous open-weight models including GLM-4.7, GPT-OSS-120B, and Qwen 3, while delivering dramatic speedups to customers such as OpenAI and Cognition on agentic coding models. K2.6 is one of the most frequently requested models, and we are excited to bring it to customers. It is the first one trillion parameter open-weight model we have served, achieving performance approaching 1,000 tokens per second as measured by Artificial Analysis. The result: agentic coding shifts from wait-and-review loops to real-time development, dramatically boosting developer productivity. The Fastest 1 Trillion Parameter Open Model Artificial Analysis measured Cerebras running K2.6 at 981 output tokens per second — 6.7x faster than the next-fastest GPU-based cloud, and 23x faster than the median inference provider. Fast inference dramatically speeds up end-to-end response time, making agentic coding feel near instant. For a 10,000-token input request — inclusive of prompt processing, reasoning, and generating 500 output tokens — Cerebras delivered the full response in 5.6 seconds, compared to 163.7 seconds on the official Kimi endpoint. That is a 29x improvement in time to final answer. “Cerebras has achieved 981 tokens per second on Kimi K2.6 — the fastest performance we have ever measured on a trillion parameter model. The result is consistent with Cerebras's track record of leading output speeds across the models served on Cerebras hardware,” said George Cameron, co-founder, Artificial Analysis. Kimi K2.6 – Frontier Intelligence, Open Weights K2.6 is widely regarded as the leading open-weight model for coding and agentic work. It tops SWE-Bench Pro at 58.6, outperforming Claude Opus 4.6 and matching GPT-5.4, while leading the field on agentic benchmarks like Humanity’s Last Exam and DeepSearchQA. Developers have adopted it as the open alternative to closed-source frontier models — particularly for coding, where its taste for clean front-end design has made it a favorite for full-stack application generation. The 2.6 release extends that capability from front-end into full-stack workflows, including authentication, database operations, and long-horizon agent execution. Trillion Parameter Serving on Cerebras The Cerebras Wafer-Scale Engine is built for scale. A cluster of CS-3 systems can be configured to support multi-trillion parameter models for both training and inference, and we have spent significant engineering effort optimizing the stack to serve large models efficiently. Cerebras stores Kimi K2.6 in the model’s original 4-bit weights while performing computation at 16-bit floating point, for optimal accuracy. Weights are distributed across multiple wafers, with activations streamed between them. All-to-all communications between layers run entirely using on-wafer network fabric, which has over 200x the bandwidth of NVLink on NVL72. Combined with our custom kernels and speculative decoding, we can serve trillion parameter MoE models at close to 1,000 tokens per second, setting a world record. What This Unlocks: Agentic Coding at Speed Agentic coding has become the highest-value use case for large language models, and it is the workload most sensitive…

#inference#coding

read full article on Cerebras Blog →

0login to vote