Transformer block
Combine attention, MLP, layer normalization, and residual connections into a complete transformer block.
GPT2Block is the repeating unit of GPT-2. It wires together all the
components from the previous sections: layer normalization, multi-head
attention, and the feed-forward network, connected by residual connections.
GPT-2 stacks 12 identical copies of this block. Each refines the representation produced by the previous block, building from surface-level patterns in early layers to abstract semantic understanding in later layers.
The pre-norm pattern
Each sublayer follows the same structure: normalize first, apply the sublayer, then add the original input back:
x = x + sublayer(layer_norm(x))
This is called pre-normalization. GPT-2 uses it because normalizing before each sublayer (rather than after) gives more stable gradients in deep networks—the residual connection provides a direct path for gradients to flow backward through all 12 blocks without passing through the normalization.
The pattern happens twice per block:
- Attention:
hidden_states = attn_output + residual(whereresidualis the pre-norm input) - MLP:
hidden_states = residual + feed_forward_hidden_states
The block maintains a constant 768-dimensional representation throughout. Input
shape [batch, seq_length, 768] is unchanged after each sublayer, which is
essential for stacking 12 blocks together.
Component names
ln_1, attn, ln_2, and mlp match Hugging Face’s GPT-2 implementation
exactly. This naming is required for loading pretrained weights.
The code
class GPT2Block(Module): # type: ignore[type-arg]
"""Exact HuggingFace GPT-2 transformer block structure"""
def __init__(self, config: GPT2Config) -> None:
hidden_size = config.n_embd
inner_dim = (
config.n_inner
if hasattr(config, "n_inner") and config.n_inner is not None
else 4 * hidden_size
)
self.ln_1 = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
self.attn = GPT2MultiHeadAttention(config)
self.ln_2 = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
self.mlp = GPT2MLP(inner_dim, config)
def forward(self, hidden_states: Tensor) -> Tensor:
residual = hidden_states
hidden_states = self.ln_1(hidden_states)
attn_output = self.attn(hidden_states)
hidden_states = attn_output + residual
residual = hidden_states
hidden_states = self.ln_2(hidden_states)
feed_forward_hidden_states = self.mlp(hidden_states)
hidden_states = residual + feed_forward_hidden_states
return hidden_states
Next: Section 7 stacks 12 of these blocks with embeddings to create the main body of the GPT-2 model.