Step 04: Feed-forward network

Learn to build the feed-forward network (MLP) that processes information after attention in each transformer block.

Building the MLP

In this step, you’ll create the GPT2MLP class: a two-layer feed-forward network that appears after attention in every transformer block. The MLP expands the embedding dimension by 4× (768 → 3,072), applies GELU activation for non-linearity, then projects back to the original dimension.

While attention lets tokens communicate with each other, the MLP processes each position independently. Attention aggregates information through weighted sums (linear operations), but the MLP adds non-linearity through GELU activation. This combination allows the model to learn complex patterns beyond what linear transformations alone can capture.

GPT-2 uses a 4× expansion ratio (768 to 3,072 dimensions) because this was found to work well in the original Transformer paper and has been validated across many architectures since.

Understanding the components

The MLP has three steps applied in sequence:

Expansion layer (c_fc): Projects from 768 to 3,072 dimensions using a linear layer. This expansion gives the network more capacity to process information.

GELU activation: Applies Gaussian Error Linear Unit, a smooth non-linear function. GPT-2 uses approximate="tanh" for the tanh-based approximation instead of the exact computation. This approximation was faster when GPT-2 was implemented, but while exact GELU is fast enough now, we use the approximation to match the original weights.

Projection layer (c_proj): Projects back from 3,072 to 768 dimensions using another linear layer. This returns to the embedding dimension so outputs can be added to residual connections.

The layer names c_fc (fully connected) and c_proj (projection) match Hugging Face’s GPT-2 checkpoint structure. This naming is essential for loading pretrained weights.

MAX operations

You’ll use the following MAX operations to complete this task:

Linear layers:

GELU activation:

Implementing the MLP

You’ll create the GPT2MLP class that chains two linear layers with GELU activation between them. The implementation is straightforward - three operations applied in sequence.

First, import the required modules. You’ll need functional as F for the GELU activation, Tensor for type hints, Linear for the layers, and Module as the base class.

In the __init__ method, create two linear layers:

  • Expansion layer: Linear(embed_dim, intermediate_size, bias=True) stored as self.c_fc
  • Projection layer: Linear(intermediate_size, embed_dim, bias=True) stored as self.c_proj

Both layers include bias terms (bias=True). The intermediate size is typically 4× the embedding dimension.

In the forward method, apply the three transformations:

  1. Expand: hidden_states = self.c_fc(hidden_states)
  2. Activate: hidden_states = F.gelu(hidden_states, approximate="tanh")
  3. Project: hidden_states = self.c_proj(hidden_states)

Return the final hidden_states. The input and output shapes are the same: [batch, seq_length, embed_dim].

Implementation (step_04.py):

"""
Step 04: Feed-forward Network (MLP)

Implement the MLP used in each transformer block with GELU activation.

Tasks:
1. Import functional (as F), Tensor, Linear, and Module from MAX
2. Create c_fc linear layer (embedding to intermediate dimension)
3. Create c_proj linear layer (intermediate back to embedding dimension)
4. Apply c_fc transformation in forward pass
5. Apply GELU activation function
6. Apply c_proj transformation and return result

Run: pixi run s04
"""

# 1: Import the required modules from MAX
# TODO: Import functional module from max.experimental with the alias F
# https://docs.modular.com/max/api/python/experimental/functional

# TODO: Import Tensor from max.experimental.tensor
# https://docs.modular.com/max/api/python/experimental/tensor.Tensor

# TODO: Import Linear and Module from max.nn.module_v3
# https://docs.modular.com/max/api/python/nn/module_v3

from solutions.solution_01 import GPT2Config


class GPT2MLP(Module):
    """Feed-forward network matching HuggingFace GPT-2 structure.

    Args:
        intermediate_size: Size of the intermediate layer.
        config: GPT-2 configuration.
    """

    def __init__(self, intermediate_size: int, config: GPT2Config):
        super().__init__()
        embed_dim = config.n_embd

        # 2: Create the first linear layer (embedding to intermediate)
        # TODO: Create self.c_fc as a Linear layer from embed_dim to intermediate_size with bias=True
        # https://docs.modular.com/max/api/python/nn/module_v3#max.nn.module_v3.Linear
        # Hint: This is the expansion layer in the MLP
        self.c_fc = None

        # 3: Create the second linear layer (intermediate back to embedding)
        # TODO: Create self.c_proj as a Linear layer from intermediate_size to embed_dim with bias=True
        # https://docs.modular.com/max/api/python/nn/module_v3#max.nn.module_v3.Linear
        # Hint: This is the projection layer that brings us back to the embedding dimension
        self.c_proj = None

    def __call__(self, hidden_states: Tensor) -> Tensor:
        """Apply feed-forward network.

        Args:
            hidden_states: Input hidden states.

        Returns:
            MLP output.
        """
        # 4: Apply the first linear transformation
        # TODO: Apply self.c_fc to hidden_states
        # Hint: This expands the hidden dimension to the intermediate size
        hidden_states = None

        # 5: Apply GELU activation function
        # TODO: Use F.gelu() with hidden_states and approximate="tanh"
        # https://docs.modular.com/max/api/python/experimental/functional#max.experimental.functional.gelu
        # Hint: GELU is the non-linear activation used in GPT-2's MLP
        hidden_states = None

        # 6: Apply the second linear transformation and return
        # TODO: Apply self.c_proj to hidden_states and return the result
        # Hint: This projects back to the embedding dimension
        return None

Validation

Run pixi run s04 to verify your implementation.

Show solution
"""
Solution for Step 04: Feed-forward Network (MLP)

This module implements the feed-forward network (MLP) used in each
transformer block with GELU activation.
"""

from max.experimental import functional as F
from max.experimental.tensor import Tensor
from max.nn.module_v3 import Linear, Module

from solutions.solution_01 import GPT2Config


class GPT2MLP(Module):
    """Feed-forward network matching HuggingFace GPT-2 structure.

    Args:
        intermediate_size: Size of the intermediate layer.
        config: GPT-2 configuration.
    """

    def __init__(self, intermediate_size: int, config: GPT2Config):
        super().__init__()
        embed_dim = config.n_embd
        self.c_fc = Linear(embed_dim, intermediate_size, bias=True)
        self.c_proj = Linear(intermediate_size, embed_dim, bias=True)

    def __call__(self, hidden_states: Tensor) -> Tensor:
        """Apply feed-forward network.

        Args:
            hidden_states: Input hidden states.

        Returns:
            MLP output.
        """
        hidden_states = self.c_fc(hidden_states)
        hidden_states = F.gelu(hidden_states, approximate="tanh")
        hidden_states = self.c_proj(hidden_states)
        return hidden_states

Next: In Step 05, you’ll implement token embeddings to convert discrete token IDs into continuous vector representations.