Model configuration

Define the GPT-2 model architecture parameters using a configuration class.

These chapters implement gpt2_arch/gpt2.py, the model class that max serve compiles when you run the architecture package. The first thing that file needs is a configuration object.

GPT2Config holds all the architectural decisions for GPT-2: embedding dimensions, number of transformer layers, number of attention heads. These parameters define the shape and capacity of the model.

OpenAI trained the original GPT-2 model with specific parameters available in the config.json file on Hugging Face. Using the exact same values lets us load OpenAI’s pretrained weights in the final step.

The configuration parameters

Each field controls a different aspect of the model:

  • vocab_size: Size of the token vocabulary (50,257). This number is 50,000 Byte Pair Encoding tokens + 256 byte-level tokens (fallback for rare characters) + 1 special token.
  • n_positions: Maximum sequence length, also called the context window (1,024). Longer sequences require quadratic memory in attention.
  • n_embd: Embedding dimension, the size of the hidden states that flow through the model (768). This determines the model’s capacity to represent information.
  • n_layer: Number of transformer blocks stacked vertically (12). More layers allow the model to learn more complex patterns.
  • n_head: Number of attention heads per layer (12). Multiple heads let the model attend to different types of patterns simultaneously.
  • n_inner: Dimension of the MLP intermediate layer (optional; the transformer block defaults to 4× embedding when None). The 4× ratio comes from the original Attention is all you need paper.
  • layer_norm_epsilon: Small constant for numerical stability in layer normalization (1e-5). Prevents division by zero when variance is very small.

These values define the small GPT-2 model. OpenAI released four sizes (small, medium, large, XL), each scaling these parameters up.

GPT2Config

Python’s @dataclass decorator eliminates boilerplate. Instead of writing __init__ manually, you declare fields with type hints and default values:

@dataclass
class GPT2Config:
    """GPT-2 configuration matching HuggingFace"""

    vocab_size: int = 50257
    n_positions: int = 1024
    n_embd: int = 768
    n_layer: int = 12
    n_head: int = 12
    n_inner: int | None = None
    layer_norm_epsilon: float = 1e-5


The n_inner: int | None = None field is optional. When None, the transformer block defaults to 4× the embedding dimension (3,072). This lets you override the inner dimension for experimental architectures without changing the other components.

GPT2ArchConfig in model_config.py reads n_head, n_embd, and n_layer from this config at serving time to calculate KV cache dimensions.

Next: Feed-forward network implements the MLP that processes information after attention in each transformer block.