Step 01: Model configuration
Learn to define the GPT-2 model architecture parameters using configuration classes.
Defining the model architecture
Before you can implement GPT-2, you need to define its architecture: the dimensions, layer counts, and structural parameters that determine how the model processes information.
In this step, you’ll create GPT2Config, a class that holds all the architectural decisions for GPT-2. This class describes things like: embedding dimensions, number of transformer layers, and number of attention heads. These parameters define the shape and capacity of your model.
OpenAI trained the original GPT-2 model with specific parameters that you can see in the config.json file on Hugging Face. By using the exact same values, we can access OpenAI’s pretrained weights in subsequent steps.
Understanding the parameters
Looking at the config.json file file, we can see some key information about the model. Each parameter controls a different aspect of the model’s architecture:
vocab_size: Size of the token vocabulary (default: 50,257). This seemingly odd number is actually 50,000 Byte Pair Encoding (BPE) tokens + 256 byte-level tokens (fallback for rare characters) + 1 special token.n_positions: Maximum sequence length, also called the context window (default: 1,024). Longer sequences require quadratic memory in attention.n_embd: Embedding dimension, or the size of the hidden states that flow through the model (default: 768). This determines the model’s capacity to represent information.n_layer: Number of transformer blocks stacked vertically (default: 12). More layers allow the model to learn more complex patterns.n_head: Number of attention heads per layer (default: 12). Multiple heads let the model attend to different types of patterns simultaneously.n_inner: Dimension of the MLP intermediate layer (default: 3,072). This is 4x the embedding dimension, a ratio found empirically in the Attention is all you need paper to work well.layer_norm_epsilon: Small constant for numerical stability in layer normalization (default:1e-5). This prevents division by zero when variance is very small.
These values define the small GPT-2 model. OpenAI released four sizes (small, medium, large, XL), each with different configurations that scale up these parameters. For implementation purposes we will use these parameters.
Implementing the configuration
Now let’s implement this yourself. You’ll create the GPT2Config class using Python’s @dataclass decorator. Dataclasses reduce boilerplate.
Instead of writing __init__ and defining each parameter manually, you just declare the fields with type hints and default values.
First, you’ll need to import the dataclass decorator from the dataclasses module. Then you’ll add the @dataclass decorator to the GPT2Config class definition.
The actual parameter values come from Hugging Face. You can get them in two ways:
- Option 1: Run
pixi run huggingfaceto access these parameters programmatically from the Hugging Facetransformerslibrary. - Option 2: Read the values directly from the GPT-2 model card.
Once you have the values, replace each None in the GPT2Config class properties with the correct numbers from the configuration.
Implementation (step_01.py):
"""
Step 01: Model Configuration
Implement the GPT-2 configuration dataclass that stores model hyperparameters.
Tasks:
1. Import dataclass from the dataclasses module
2. Add the @dataclass decorator to the GPT2Config class
3. Fill in the configuration values from HuggingFace GPT-2 model
Run: pixi run s01
"""
# 1. Import dataclass from the dataclasses module
# 2. Add the Python @dataclass decorator to the GPT2Config class
class GPT2Config:
"""GPT-2 configuration matching HuggingFace.
Attributes:
vocab_size: Size of the vocabulary.
n_positions: Maximum sequence length.
n_embd: Embedding dimension.
n_layer: Number of transformer layers.
n_head: Number of attention heads.
n_inner: Inner dimension of feed-forward network (defaults to 4 * n_embd if None).
layer_norm_epsilon: Epsilon for layer normalization.
"""
# 3a. Run `pixi run huggingface` to access the model parameters from the Hugging Face `transformers` library
# 3b. Alternately, read the values from GPT-2 model card: https://huggingface.co/openai-community/gpt2/blob/main/config.json
# 4. Replace the None of the GPT2Config properties with the correct values
vocab_size: int = None
n_positions: int = None
n_embd: int = None
n_layer: int = None
n_head: int = None
n_inner: int = None # Equal to 4 * n_embd
layer_norm_epsilon: float = None
Validation
Run pixi run s01 to verify your implementation matches the expected configuration.
Show solution
"""
Solution for Step 01: Model Configuration
This module implements the GPT-2 configuration dataclass that stores
hyperparameters matching HuggingFace's GPT-2 model structure.
"""
from dataclasses import dataclass
@dataclass
class GPT2Config:
"""GPT-2 configuration matching HuggingFace.
Attributes:
vocab_size: Size of the vocabulary.
n_positions: Maximum sequence length.
n_embd: Embedding dimension.
n_layer: Number of transformer layers.
n_head: Number of attention heads.
n_inner: Inner dimension of feed-forward network (defaults to 4 * n_embd if None).
layer_norm_epsilon: Epsilon for layer normalization.
"""
vocab_size: int = 50257
n_positions: int = 1024
n_embd: int = 768
n_layer: int = 12
n_head: int = 12
n_inner: int = 3072
layer_norm_epsilon: float = 1e-5
Next: In Step 02, you’ll implement causal masking to prevent tokens from attending to future positions in autoregressive generation.