Transformer block

Combine attention, MLP, layer normalization, and residual connections into a complete transformer block.

GPT2Block is the repeating unit of GPT-2. It wires together all the components from the previous sections: layer normalization, multi-head attention, and the feed-forward network, connected by residual connections.

GPT-2 stacks 12 identical copies of this block. Each refines the representation produced by the previous block, building from surface-level patterns in early layers to abstract semantic understanding in later layers.

The pre-norm pattern

Each sublayer follows the same structure: normalize first, apply the sublayer, then add the original input back:

x = x + sublayer(layer_norm(x))

This is called pre-normalization. GPT-2 uses it because normalizing before each sublayer (rather than after) gives more stable gradients in deep networks. The residual connection provides a direct path for gradients to flow backward through all 12 blocks without passing through the normalization.

The pattern happens twice per block:

Attention: hidden_states = attn_output + residual (where residual is the pre-norm input).
MLP: hidden_states = residual + feed_forward_hidden_states.

The block maintains a constant 768-dimensional representation throughout. Input shape [batch, seq_length, 768] is unchanged after each sublayer, which is essential for stacking 12 blocks together.

Component names

ln_1, attn, ln_2, and mlp match Hugging Face’s GPT-2 implementation exactly. This naming is required for loading pretrained weights.

GPT2Block

GPT2Block wires the four components (ln_1, attn, ln_2, mlp) with pre-norm and residual connections in two passes:

class GPT2Block(Module):  # type: ignore[type-arg]
    """Exact HuggingFace GPT-2 transformer block structure"""

    def __init__(self, config: GPT2Config) -> None:
        hidden_size = config.n_embd
        inner_dim = config.n_inner if config.n_inner is not None else 4 * hidden_size

        self.ln_1 = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
        self.attn = GPT2MultiHeadAttention(config)
        self.ln_2 = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
        self.mlp = GPT2MLP(inner_dim, config)

    def forward(self, hidden_states: Tensor) -> Tensor:
        residual = hidden_states
        hidden_states = self.ln_1(hidden_states)
        attn_output = self.attn(hidden_states)
        hidden_states = attn_output + residual

        residual = hidden_states
        hidden_states = self.ln_2(hidden_states)
        feed_forward_hidden_states = self.mlp(hidden_states)
        hidden_states = residual + feed_forward_hidden_states

        return hidden_states

The block reads input at 768 dimensions, normalizes and applies attention with a residual, then normalizes and applies the MLP with another residual. Input and output shapes are identical, which is what makes stacking 12 of them possible.

Next: Stack transformer blocks stacks 12 of these blocks with embeddings to create the main body of the GPT-2 model.