Step 01: Model configuration

Learn to define the GPT-2 model architecture parameters using configuration classes.

Defining the model architecture

Before you can implement GPT-2, you need to define its architecture: the dimensions, layer counts, and structural parameters that determine how the model processes information.

In this step, you’ll create GPT2Config, a class that holds all the architectural decisions for GPT-2. This class describes things like: embedding dimensions, number of transformer layers, and number of attention heads. These parameters define the shape and capacity of your model.

OpenAI trained the original GPT-2 model with specific parameters that you can see in the config.json file on Hugging Face. By using the exact same values, we can access OpenAI’s pretrained weights in subsequent steps.

Understanding the parameters

Looking at the config.json file file, we can see some key information about the model. Each parameter controls a different aspect of the model’s architecture:

  • vocab_size: Size of the token vocabulary (default: 50,257). This seemingly odd number is actually 50,000 Byte Pair Encoding (BPE) tokens + 256 byte-level tokens (fallback for rare characters) + 1 special token.
  • n_positions: Maximum sequence length, also called the context window (default: 1,024). Longer sequences require quadratic memory in attention.
  • n_embd: Embedding dimension, or the size of the hidden states that flow through the model (default: 768). This determines the model’s capacity to represent information.
  • n_layer: Number of transformer blocks stacked vertically (default: 12). More layers allow the model to learn more complex patterns.
  • n_head: Number of attention heads per layer (default: 12). Multiple heads let the model attend to different types of patterns simultaneously.
  • n_inner: Dimension of the MLP intermediate layer (default: 3,072). This is 4x the embedding dimension, a ratio found empirically in the Attention is all you need paper to work well.
  • layer_norm_epsilon: Small constant for numerical stability in layer normalization (default: 1e-5). This prevents division by zero when variance is very small.

These values define the small GPT-2 model. OpenAI released four sizes (small, medium, large, XL), each with different configurations that scale up these parameters. For implementation purposes we will use these parameters.

Implementing the configuration

Now let’s implement this yourself. You’ll create the GPT2Config class using Python’s @dataclass decorator. Dataclasses reduce boilerplate.

Instead of writing __init__ and defining each parameter manually, you just declare the fields with type hints and default values.

First, you’ll need to import the dataclass decorator from the dataclasses module. Then you’ll add the @dataclass decorator to the GPT2Config class definition.

The actual parameter values come from Hugging Face. You can get them in two ways:

  • Option 1: Run pixi run huggingface to access these parameters programmatically from the Hugging Face transformers library.
  • Option 2: Read the values directly from the GPT-2 model card.

Once you have the values, replace each None in the GPT2Config class properties with the correct numbers from the configuration.

Implementation (step_01.py):

"""
Step 01: Model Configuration

Implement the GPT-2 configuration dataclass that stores model hyperparameters.

Tasks:
1. Import dataclass from the dataclasses module
2. Add the @dataclass decorator to the GPT2Config class
3. Fill in the configuration values from HuggingFace GPT-2 model

Run: pixi run s01
"""

# 1. Import dataclass from the dataclasses module

# 2. Add the Python @dataclass decorator to the GPT2Config class


class GPT2Config:
    """GPT-2 configuration matching HuggingFace.

    Attributes:
        vocab_size: Size of the vocabulary.
        n_positions: Maximum sequence length.
        n_embd: Embedding dimension.
        n_layer: Number of transformer layers.
        n_head: Number of attention heads.
        n_inner: Inner dimension of feed-forward network (defaults to 4 * n_embd if None).
        layer_norm_epsilon: Epsilon for layer normalization.
    """

    # 3a. Run `pixi run huggingface` to access the model parameters from the Hugging Face `transformers` library
    # 3b. Alternately, read the values from GPT-2 model card: https://huggingface.co/openai-community/gpt2/blob/main/config.json
    # 4. Replace the None of the GPT2Config properties with the correct values
    vocab_size: int = None
    n_positions: int = None
    n_embd: int = None
    n_layer: int = None
    n_head: int = None
    n_inner: int = None  # Equal to 4 * n_embd
    layer_norm_epsilon: float = None

Validation

Run pixi run s01 to verify your implementation matches the expected configuration.

Show solution
"""
Solution for Step 01: Model Configuration

This module implements the GPT-2 configuration dataclass that stores
hyperparameters matching HuggingFace's GPT-2 model structure.
"""

from dataclasses import dataclass


@dataclass
class GPT2Config:
    """GPT-2 configuration matching HuggingFace.

    Attributes:
        vocab_size: Size of the vocabulary.
        n_positions: Maximum sequence length.
        n_embd: Embedding dimension.
        n_layer: Number of transformer layers.
        n_head: Number of attention heads.
        n_inner: Inner dimension of feed-forward network (defaults to 4 * n_embd if None).
        layer_norm_epsilon: Epsilon for layer normalization.
    """

    vocab_size: int = 50257
    n_positions: int = 1024
    n_embd: int = 768
    n_layer: int = 12
    n_head: int = 12
    n_inner: int = 3072
    layer_norm_epsilon: float = 1e-5

Next: In Step 02, you’ll implement causal masking to prevent tokens from attending to future positions in autoregressive generation.