Step 04: Feed-forward network
Learn to build the feed-forward network (MLP) that processes information after attention in each transformer block.
Building the MLP
In this step, you’ll create the GPT2MLP class: a two-layer feed-forward
network that appears after attention in every transformer block. The MLP expands
the embedding dimension by 4× (768 → 3,072), applies GELU activation for
non-linearity, then projects back to the original dimension.
While attention lets tokens communicate with each other, the MLP processes each position independently. Attention aggregates information through weighted sums (linear operations), but the MLP adds non-linearity through GELU activation. This combination allows the model to learn complex patterns beyond what linear transformations alone can capture.
GPT-2 uses a 4× expansion ratio (768 to 3,072 dimensions) because this was found to work well in the original Transformer paper and has been validated across many architectures since.
Understanding the components
The MLP has three steps applied in sequence:
Expansion layer (c_fc): Projects from 768 to 3,072 dimensions using a linear layer. This expansion gives the network more capacity to process information.
GELU activation: Applies Gaussian Error Linear Unit, a smooth non-linear
function. GPT-2 uses approximate="tanh" for the tanh-based approximation
instead of the exact computation. This approximation was faster when GPT-2 was
implemented, but while exact GELU is fast enough now, we use the approximation
to match the original weights.
Projection layer (c_proj): Projects back from 3,072 to 768 dimensions
using another linear layer. This returns to the embedding dimension so outputs
can be added to residual connections.
The layer names c_fc (fully connected) and c_proj (projection) match Hugging
Face’s GPT-2 checkpoint structure. This naming is essential for loading
pretrained weights.
You’ll use the following MAX operations to complete this task:
Linear layers:
Linear(in_features, out_features, bias=True): Applies linear transformationy = xW^T + b
GELU activation:
F.gelu(input, approximate="tanh"): Applies GELU activation with tanh approximation for faster computation
Implementing the MLP
You’ll create the GPT2MLP class that chains two linear layers with GELU
activation between them. The implementation is straightforward - three
operations applied in sequence.
First, import the required modules. You’ll need functional as F for the GELU
activation, Tensor for type hints, Linear for the layers, and Module as
the base class.
In the __init__ method, create two linear layers:
- Expansion layer:
Linear(embed_dim, intermediate_size, bias=True)stored asself.c_fc - Projection layer:
Linear(intermediate_size, embed_dim, bias=True)stored asself.c_proj
Both layers include bias terms (bias=True). The intermediate size is typically
4× the embedding dimension.
In the forward method, apply the three transformations:
- Expand:
hidden_states = self.c_fc(hidden_states) - Activate:
hidden_states = F.gelu(hidden_states, approximate="tanh") - Project:
hidden_states = self.c_proj(hidden_states)
Return the final hidden_states. The input and output shapes are the same:
[batch, seq_length, embed_dim].
Implementation (step_04.py):
"""
Step 04: Feed-forward Network (MLP)
Implement the MLP used in each transformer block with GELU activation.
Tasks:
1. Import functional (as F), Tensor, Linear, and Module from MAX
2. Create c_fc linear layer (embedding to intermediate dimension)
3. Create c_proj linear layer (intermediate back to embedding dimension)
4. Apply c_fc transformation in forward pass
5. Apply GELU activation function
6. Apply c_proj transformation and return result
Run: pixi run s04
"""
# 1: Import the required modules from MAX
# TODO: Import functional module from max.experimental with the alias F
# https://docs.modular.com/max/api/python/experimental/functional
# TODO: Import Tensor from max.experimental.tensor
# https://docs.modular.com/max/api/python/experimental/tensor.Tensor
# TODO: Import Linear and Module from max.nn.module_v3
# https://docs.modular.com/max/api/python/nn/module_v3
from solutions.solution_01 import GPT2Config
class GPT2MLP(Module):
"""Feed-forward network matching HuggingFace GPT-2 structure.
Args:
intermediate_size: Size of the intermediate layer.
config: GPT-2 configuration.
"""
def __init__(self, intermediate_size: int, config: GPT2Config):
super().__init__()
embed_dim = config.n_embd
# 2: Create the first linear layer (embedding to intermediate)
# TODO: Create self.c_fc as a Linear layer from embed_dim to intermediate_size with bias=True
# https://docs.modular.com/max/api/python/nn/module_v3#max.nn.module_v3.Linear
# Hint: This is the expansion layer in the MLP
self.c_fc = None
# 3: Create the second linear layer (intermediate back to embedding)
# TODO: Create self.c_proj as a Linear layer from intermediate_size to embed_dim with bias=True
# https://docs.modular.com/max/api/python/nn/module_v3#max.nn.module_v3.Linear
# Hint: This is the projection layer that brings us back to the embedding dimension
self.c_proj = None
def __call__(self, hidden_states: Tensor) -> Tensor:
"""Apply feed-forward network.
Args:
hidden_states: Input hidden states.
Returns:
MLP output.
"""
# 4: Apply the first linear transformation
# TODO: Apply self.c_fc to hidden_states
# Hint: This expands the hidden dimension to the intermediate size
hidden_states = None
# 5: Apply GELU activation function
# TODO: Use F.gelu() with hidden_states and approximate="tanh"
# https://docs.modular.com/max/api/python/experimental/functional#max.experimental.functional.gelu
# Hint: GELU is the non-linear activation used in GPT-2's MLP
hidden_states = None
# 6: Apply the second linear transformation and return
# TODO: Apply self.c_proj to hidden_states and return the result
# Hint: This projects back to the embedding dimension
return None
Validation
Run pixi run s04 to verify your implementation.
Show solution
"""
Solution for Step 04: Feed-forward Network (MLP)
This module implements the feed-forward network (MLP) used in each
transformer block with GELU activation.
"""
from max.experimental import functional as F
from max.experimental.tensor import Tensor
from max.nn.module_v3 import Linear, Module
from solutions.solution_01 import GPT2Config
class GPT2MLP(Module):
"""Feed-forward network matching HuggingFace GPT-2 structure.
Args:
intermediate_size: Size of the intermediate layer.
config: GPT-2 configuration.
"""
def __init__(self, intermediate_size: int, config: GPT2Config):
super().__init__()
embed_dim = config.n_embd
self.c_fc = Linear(embed_dim, intermediate_size, bias=True)
self.c_proj = Linear(intermediate_size, embed_dim, bias=True)
def __call__(self, hidden_states: Tensor) -> Tensor:
"""Apply feed-forward network.
Args:
hidden_states: Input hidden states.
Returns:
MLP output.
"""
hidden_states = self.c_fc(hidden_states)
hidden_states = F.gelu(hidden_states, approximate="tanh")
hidden_states = self.c_proj(hidden_states)
return hidden_states
Next: In Step 05, you’ll implement token embeddings to convert discrete token IDs into continuous vector representations.