Ahead of AI (Sebastian Raschka)·Open Source·15h ago·by Sebastian Raschka, PhD·~3 min read

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention From Gemma 4 to DeepSeek V4, How New Open-Weight LLMs Are Reducing Long-Context Costs After a short family break, I am excited to be back and catching up on a busy few weeks of open-weight LLM releases. The thing that stood out to me is how much newer architectures are focused on long-context efficiency. As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints, and LLM developers are adding a growing number of architecture tricks to reduce those costs. The main examples I want to look at are KV sharing and per-layer embeddings in Gemma 4, layer-wise attention budgeting in Laguna XS.2, compressed convolutional attention in ZAYA1-8B, and mHC plus compressed attention in DeepSeek V4. Most of these changes look like small tweaks in my architecture diagrams, but some of them are quite intricate design changes that are worth a more detailed discussion. Note that this article is about architecture designs, so I will mostly skip dataset mixtures, training schedules, post-training details, RL recipes, benchmark tables, and product comparisons. Even with that narrower scope, there is a lot to cover. And, like always, the article turned out longer than I expected, so I will keep the focus on what changes inside the transformer block, residual stream, KV cache, or attention computation. Please also note that I am only covering those topics that are interesting (new) design choices and that I haven’t covered elsewhere, yet. This list includes: KV sharing and per-layer embeddings in Gemma 4 Compressed convolutional attention in ZAYA1 Attention budgeting in Laguna XS.2 mHC and compressed attention in DeepSeek V4 Previous Topics Before getting into the new parts, here are the two previous articles I will refer back to. The first one gives a broader architecture background on recent MoE models, routed experts, active parameters, and model-size comparisons. The second one covers the attention background that comes up repeatedly below, including MHA, MQA, GQA, MLA, sliding-window attention, sparse attention, and hybrid attention designs. I also turned several of these explanations into short, standalone tutorial pages in the LLM Architecture Gallery. For example, readers can find compact explainers for GQA, MLA, sliding-window attention, DeepSeek Sparse Attention, MoE routing, and other concepts linked from the corresponding model cards and concept labels. 1. Reusing KV Tensors Across Layers to Shrink the Cache (Gemma 4) For this tour of architecture advances and tweaks, we will go back to the beginning of April when Google released their new open-weight Gemma 4 suite of models. They come in 3 broad categories: the Gemma 4 E2B and E4B models for mobile and small, local (embedded) devices (aka IoT), the Gemma 4 26B mixture-of-experts (MoE) model, optimized for efficient local inference, and the Gemma 4 31B dense model, for maximum quality and more convenient post-training (since MoEs are trickier to work with) The first small architecture tweak in the E2B and…

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention — image 2

read full article on Ahead of AI (Sebastian Raschka) →

0login to vote