Stratiformer: Depth-Aware Heterogeneous Layer Design for Efficient Language Models

<aside> ✨

Abstract

Modern large language models stack identical transformer blocks at every depth, despite growing evidence that shallow and deep layers serve distinct computational roles. We ask a simple question: should every layer have the same architecture? Starting from a dense transformer baseline, we study two per-layer modifications: (1) Mixture-of-Experts (MoE), which adds capacity without increasing per-token computation, and (2) Cross-Layer Weight Sharing (CLWS), which reduces stored parameters while preserving the same forward pass. Through systematic placement sweeps at 286M and 1.68B parameters, we find that MoE yields the largest gains at later layers, while weight sharing causes the least degradation at earlier layers. These complementary findings motivate Stratiformer, a heterogeneous architecture that allocates MoE to late layers and weight sharing to early layers. In controlled experiments, Stratiformer matches the dense baseline in both active parameters and FLOPs per token, while achieving lower validation bits-per-byte and higher downstream task accuracy.

📝 This is a research preview. Additional results will be released soon in Pre-print.

https://github.com/Tim-Siu/nanochat/tree/dev-hetero

🌐 This page is also available at https://www.notion.so/Stratiformer-30c622b08b6780eb9976cc8453d55fa0

</aside>

1. Introduction

The dominant recipe for building large language models is remarkably uniform: stack $L$ identical transformer blocks [1], each consisting of a self-attention sublayer followed by a feed-forward network (FFN). Recent frontier models such as Qwen3 [2] and DeepSeek-V3 [3] follow this template, varying primarily in whether the FFN is dense or a Mixture-of-Experts (MoE).

Yet a growing body of evidence suggests that not all layers play the same role. Mechanistic interpretability studies find that early layers tend to build contextual representations and retrieve stored knowledge, while later layers perform the composition and refinement that most directly shape predictions [4, 5]. Representation similarity analyses confirm that adjacent late layers are more redundant than adjacent early layers [6]. If layers differ in function, it stands to reason they might also differ in optimal capacity.

We investigate this idea through two complementary interventions applied at varying depths:

Mixture-of-Experts (MoE) layers replace the dense FFN with a routed expert mixture, increasing total stored parameters while keeping per-token active parameters and FLOPs constant.
Cross-Layer Weight Sharing (CLWS) ties the FFN weights of consecutive layers, reducing unique stored parameters while preserving per-token computation.

MoE adds capacity; weight sharing removes it. By sweeping the position of each intervention across the full model depth, we can map where extra capacity helps most and where reduced capacity hurts least.

Our experiments at two model scales (286M and 1.68B parameters) reveal a consistent pattern:

MoE is most beneficial at later layers, where the model refines predictions and benefits from routing diversity.
Weight sharing is least harmful at earlier layers, where adjacent layers already perform similar, more generic computation.

Figure 1: Overview of the Stratiformer architecture. Left: a standard dense transformer stacks $L$ identical blocks. Right: Stratiformer stratifies the network by depth — early layers share FFN weights (CLWS, blue), middle layers remain dense (gray), and late layers use Mixture-of-Experts (MoE, red). Active parameters and FLOPs per token are identical to the dense baseline.

Guided by these findings, we propose Stratiformer — a stratified transformer that pairs MoE at late layers with weight sharing at early layers (Figure 1). The extra parameters introduced by MoE are offset by the savings from weight sharing, so Stratiformer matches the dense baseline in active parameters, FLOPs per token, and approximate total parameter count, while achieving better validation loss and downstream accuracy.

2. Methodology

2.1 Dense Transformer Baseline

Our baseline is a decoder-only transformer [1] following the NanoChat recipe [24], which incorporates several modern design choices:

Attention. Grouped-query attention (GQA) [7] with rotary position embeddings (RoPE) [8] at base frequency 10,000. A tiled sliding-window pattern assigns local windows of 1,024 tokens or full-context windows of 2,048 tokens to alternating layers, following the “SSSL” pattern (the final layer always uses full context) [9].
FFN. A gated FFN with intermediate dimension $4 \times d_\text{model}$, using the squared ReLU activation ($\text{ReLU}(x)^2$) [10]. No bias terms.
Normalization. Pre-norm RMSNorm [11] before both the attention and FFN sublayers.
Output. Logit soft-capping at 15 ($15 \cdot \tanh(z/15)$) [12].