Model configuration
Define the GPT-2 model architecture parameters using a configuration class.
These chapters implement gpt2_arch/gpt2.py, the model class that max serve
compiles when you run the architecture package. The first thing that file needs
is a configuration object.
GPT2Config holds all the architectural decisions for GPT-2: embedding
dimensions, number of transformer layers, number of attention heads. These
parameters define the shape and capacity of the model.
OpenAI trained the original GPT-2 model with specific parameters available in the config.json file on Hugging Face. Using the exact same values lets us load OpenAI’s pretrained weights in the final step.
The configuration parameters
Each field controls a different aspect of the model:
vocab_size: Size of the token vocabulary (50,257). This number is 50,000 Byte Pair Encoding tokens + 256 byte-level tokens (fallback for rare characters) + 1 special token.n_positions: Maximum sequence length, also called the context window (1,024). Longer sequences require quadratic memory in attention.n_embd: Embedding dimension, the size of the hidden states that flow through the model (768). This determines the model’s capacity to represent information.n_layer: Number of transformer blocks stacked vertically (12). More layers allow the model to learn more complex patterns.n_head: Number of attention heads per layer (12). Multiple heads let the model attend to different types of patterns simultaneously.n_inner: Dimension of the MLP intermediate layer (optional; the transformer block defaults to 4× embedding whenNone). The 4× ratio comes from the original Attention is all you need paper.layer_norm_epsilon: Small constant for numerical stability in layer normalization (1e-5). Prevents division by zero when variance is very small.
These values define the small GPT-2 model. OpenAI released four sizes (small, medium, large, XL), each scaling these parameters up.
GPT2Config
Python’s
@dataclass decorator
eliminates boilerplate. Instead of writing __init__ manually, you declare
fields with type hints and default values:
@dataclass
class GPT2Config:
"""GPT-2 configuration matching HuggingFace"""
vocab_size: int = 50257
n_positions: int = 1024
n_embd: int = 768
n_layer: int = 12
n_head: int = 12
n_inner: int | None = None
layer_norm_epsilon: float = 1e-5
The n_inner: int | None = None field is optional. When None, the
transformer block defaults to 4× the embedding dimension (3,072). This lets you
override the inner dimension for experimental architectures without changing the
other components.
GPT2ArchConfig in model_config.py reads n_head, n_embd, and n_layer
from this config at serving time to calculate KV cache dimensions.
Next: Feed-forward network implements the MLP that processes information after attention in each transformer block.