entity Wed Apr 08 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Kimi Linear

#language-model#mixture-of-experts#moonshot-ai#kimi#transformer

Kimi Linear

Kimi Linear is a large language model architecture developed by Moonshot AI (the Kimi Team). It is a Mixture-of-Experts (MoE) Transformer with 48B total parameters and 3B activated parameters, following the Moonlight / DeepSeek-V3 design paradigm. The attention-residuals paper uses Kimi Linear as the production-scale testbed for integrating Block Attention Residuals.

Architecture

Based on what is described in the Attention Residuals paper:

Model class: MoE Transformer following the Moonlight / DeepSeek-V3 design.
Total parameters: 48B.
Activated parameters: 3B per token.
Transformer blocks: 27 (54 layers, counting attention and MLP sublayers separately).
Expert routing: 8 out of 256 routed experts plus 1 shared expert per MoE layer.
Attention: Hybrid design interleaving Kimi Delta Attention (KDA) and Multi-Head Latent Attention (MLA) layers in a 3:1 ratio, each followed by an MoE feed-forward layer. MLA operates without positional encodings (NoPE), so context extension requires no modifications such as YaRN or attention temperature rescaling.

Training Recipe (as used in the AttnRes paper)

Pre-training data: 1.4T tokens total.
Context window: 4096 tokens during pre-training, extended to 32K during mid-training.
Optimizer: Muon.
Learning rate schedule: WSD (Warmup-Stable-Decay).
Global batch size: 8M tokens.
Two-stage training: (i) WSD pre-training on 1T tokens, followed by (ii) mid-training on ~400B high-quality tokens following the Moonlight annealing recipe.

Integration with Attention Residuals

When equipped with Block AttnRes:

Block configuration: 6 layers per block, producing 9 blocks plus the token embedding for 10 depth-wise sources.
Training overhead: Less than 4% under pipeline parallelism.
Inference overhead: Less than 2% on typical workloads.

The AttnRes variant matches or outperforms the baseline on all evaluated benchmarks, with particularly strong gains on multi-step reasoning (GPQA-Diamond +7.5), math (Math +3.6), and code generation (HumanEval +3.1).

attention-residuals — The paper in which Kimi Linear serves as the primary evaluation platform
block-attention-residuals — The Block AttnRes variant integrated into Kimi Linear
transformer-architecture — The broader architecture family

Kimi Linear

Architecture

Training Recipe (as used in the AttnRes paper)

Integration with Attention Residuals

Related Pages