Load weights and run model
Learn to load pretrained weights from HuggingFace and prepare the model for text generation.
With all components implemented, you’re ready to load OpenAI’s pretrained GPT-2 weights and run the model. This step brings everything together: loading weights from HuggingFace, handling weight format differences, initializing the tokenizer, and compiling the model for efficient inference.
The HuggingFace transformers library provides OpenAI’s pretrained GPT-2
weights. You’ll load these weights into your MAX implementation, making your
model immediately capable of generating coherent text without training.
However, there’s a complication: HuggingFace’s GPT-2 uses Conv1D layers for its linear transformations, while your MAX implementation uses standard Linear layers. These store weights in transposed formats, so you’ll need to transpose specific weight matrices after loading.
Understanding weight loading
Weight loading involves three steps: loading the HuggingFace model, transferring weights to your MAX model, and transposing Conv1D weights.
First, load the pretrained model with GPT2LMHeadModel.from_pretrained("gpt2").
This downloads the weights (about 500MB) and returns a PyTorch model with the
exact architecture you’ve implemented.
Next, transfer these weights to your MAX model using
max_model.load_state_dict(hf_model.state_dict()). The state_dict is a
dictionary mapping layer names to weight tensors. Since your MAX model has the
exact same architecture and layer names, this transfer works seamlessly.
Finally, transpose the weights for layers that use Conv1D in HuggingFace:
c_attn, c_proj, and c_fc. Conv1D stores weights in shape
[in_features, out_features], while Linear expects
[out_features, in_features]. Use the .T property to transpose:
child.weight = child.weight.T.
Understanding model compilation
Before you can run text generation, compile the model with
.compile(token_type). Compilation analyzes the model’s computation graph and
generates optimized code for your hardware.
First, you need to specify the token_type input using TensorType. This tells
the MAX compiler what shape and dtype to expect:
token_type = TensorType(
DType.int64,
("batch", "seqlen"),
device=DeviceRef.from_device(device)
)
The shape uses symbolic dimensions ("batch", "seqlen") rather than concrete
numbers like [1, 20]. This allows the compiled model to handle any batch size
and sequence length, not just fixed dimensions.
Compilation takes a few seconds but only happens once. After compilation, inference is much faster because MAX has optimized the entire computation graph.
Understanding the tokenizer
Back in step 9, you implemented functions to encode and decode tokens, but both
functions require a tokenizer argument. Now you’ll load that tokenizer from
Hugging Face, using GPT2Tokenizer.from_pretrained("gpt2"),
which downloads the same tokenization rules OpenAI used during training.
Set the padding token to match the end-of-sequence token:
tokenizer.pad_token = tokenizer.eos_token. GPT-2 doesn’t have a dedicated
padding token, so we reuse the EOS token for this purpose.
Then pass the tokenizer to the generate_text() function you created
in step 10 (which passes it to tokenize_text() and decode_tokens()
from step 9).
Implementing the main function
You’ll implement the main() function that orchestrates the entire pipeline:
loading models, transferring weights, initializing the tokenizer, compiling the
model, and running an interactive prompt loop.
Start by loading the pretrained HuggingFace model:
hf_model = GPT2LMHeadModel.from_pretrained("gpt2")
Initialize your MAX model with the default device and configuration:
_, device = defaults()
config = GPT2Config()
max_model = MaxGPT2LMHeadModel(config)
The defaults() function returns (dtype, device) tuples. You only need the
device, so use _ to ignore the dtype.
Load and transpose the weights:
max_model.load_state_dict(hf_model.state_dict())
max_model.to(device)
for name, child in max_model.descendents:
if isinstance(child, Linear):
if any(layer_name in name for layer_name in ["c_attn", "c_proj", "c_fc"]):
child.weight = child.weight.T
The descendents property gives you all nested modules with their full paths.
Check each child’s name for the Conv1D layers and transpose their weights.
Initialize the tokenizer:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
Compile the model:
token_type = TensorType(
DType.int64, ("batch", "seqlen"), device=DeviceRef.from_device(device)
)
compiled_max_model = max_model.compile(token_type)
Finally, create an interactive prompt loop where users can input text and see generated results:
try:
while True:
user_input = input("Enter your prompt: ").strip()
if user_input.lower() in ['quit', 'exit', 'q']:
break
if not user_input:
continue
generated_text = generate_text(
compiled_max_model,
tokenizer,
device,
user_input,
max_new_tokens=50,
temperature=0.8,
do_sample=True
)
print(f"\nGenerated text:\n{generated_text}\n")
except KeyboardInterrupt:
print("\n\nExiting...")
The loop continues until the user types ‘quit’, ‘exit’, ‘q’, or presses Ctrl+C.
Implementation (step_11.py):
# ===----------------------------------------------------------------------=== #
#
# This file is Modular Inc proprietary.
#
# ===----------------------------------------------------------------------=== #
"""
Step 11: Load Weights and Run Model
Load pretrained GPT-2 weights from HuggingFace and run the complete model.
Tasks:
1. Load HuggingFace GPT-2 model and weights
2. Initialize MAX model and load state dict
3. Transpose weights for Conv1D->Linear compatibility
4. Compile model with correct input specification
5. Create interactive generation loop
Run: pixi run s11
"""
def run_model() -> None:
"""Load GPT-2 model, compile it, and run interactive text generation."""
# TODO: Load HuggingFace model
# Hint: hf_model = GPT2LMHeadModel.from_pretrained("gpt2")
# Hint: print(f"Loaded HuggingFace model:\n{hf_model}")
hf_model = None
# TODO: Initialize MAX model with device
# Hint: _, device = defaults()
# Hint: print(f"Using device: {device}")
# Hint: config = GPT2Config()
# Hint: max_model = MaxGPT2LMHeadModel(config)
device = None
config = None
max_model = None
print(
f"Model has {config.n_layer} layers, {config.n_head} heads, {config.n_embd} embedding dim"
)
# TODO: Load state dict and move to device
# Hint: max_model.load_state_dict(hf_model.state_dict())
# Hint: max_model.to(device)
# TODO: Transpose weights for Linear layers
# Hint: HuggingFace uses Conv1D which stores weights transposed
# Hint: for name, child in max_model.descendents:
# if isinstance(child, Linear):
# if any(layer_name in name for layer_name in ["c_attn", "c_proj", "c_fc"]):
# print(f"Transposing {name}: {child.weight.shape}")
# child.weight = child.weight.T
# TODO: Initialize tokenizer
# Hint: tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Hint: tokenizer.pad_token = tokenizer.eos_token
tokenizer = None
# TODO: Compile model
# Hint: print("\nCompiling model...")
# Hint: Create TensorType with shape ("batch", "seqlen") and int64 dtype
# Hint: token_type = TensorType(DType.int64, ("batch", "seqlen"), device=DeviceRef.from_device(device))
# Hint: compiled_max_model = max_model.compile(token_type)
compiled_max_model = None
# Interactive prompt loop
print("\n" + "=" * 50)
print("Model ready! Enter prompts to generate text.")
print("Press Ctrl+C or type 'quit' to exit.")
print("=" * 50 + "\n")
# TODO: Implement interactive generation loop
# Hint: try:
# while True:
# user_input = input("Enter your prompt: ").strip()
# if user_input.lower() in ['quit', 'exit', 'q']:
# break
# if not user_input:
# continue
# generated_text = generate_text(
# compiled_max_model, tokenizer, device, user_input,
# max_new_tokens=50, temperature=0.8, do_sample=True
# )
# print(f"\nGenerated text:\n{generated_text}\n")
# except KeyboardInterrupt:
# print("\n\nExiting...")
if __name__ == "__main__":
run_model()
Validation
Run pixi run s11 to verify your implementation.
Show solution
# ===----------------------------------------------------------------------=== #
#
# This file is Modular Inc proprietary.
#
# ===----------------------------------------------------------------------=== #
"""
Solution for Step 11: Load weights and run model
"""
from max.dtype import DType
from max.graph import DeviceRef
from max.nn import Linear
from max.tensor import TensorType, defaults
from step_01 import GPT2Config
from step_08 import MaxGPT2LMHeadModel
from step_10 import generate_text
from transformers import GPT2LMHeadModel, GPT2Tokenizer
def run_model() -> None:
# Load HuggingFace model
hf_model = GPT2LMHeadModel.from_pretrained("gpt2")
print(f"Loaded HuggingFace model:\n{hf_model}")
# Initialize Max model
_, device = defaults()
print(f"Using device: {device}")
config = GPT2Config()
max_model = MaxGPT2LMHeadModel(config)
print(
f"Model has {config.n_layer} layers, {config.n_head} heads, {config.n_embd} embedding dim"
)
# Load state dict and transpose weights
max_model.load_state_dict(hf_model.state_dict())
max_model.to(device)
for name, child in max_model.descendants:
if isinstance(child, Linear):
if any(layer_name in name for layer_name in ["c_attn", "c_proj", "c_fc"]):
print(f"Transposing {name}: {child.weight.shape}")
# The upstream model has conv1d layers instead of linear, which have their weights
# stored transposed compared to linear
child.weight = child.weight.T
# Initialize tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # Set padding token
# Compile model
print("\nCompiling model...")
token_type = TensorType(
DType.int64, ("batch", "seqlen"), device=DeviceRef.from_device(device)
)
compiled_max_model = max_model.compile(token_type)
# Interactive prompt loop
print("\n" + "=" * 50)
print("Model ready! Enter prompts to generate text.")
print("Press Ctrl+C or type 'quit' to exit.")
print("=" * 50 + "\n")
try:
while True:
user_input = input("Enter your prompt: ").strip()
if user_input.lower() in ["quit", "exit", "q"]:
print("Exiting...")
break
if not user_input:
print("Please enter a non-empty prompt.\n")
continue
print()
generated_text = generate_text(
compiled_max_model,
tokenizer,
device,
user_input,
max_new_tokens=50,
temperature=0.8,
do_sample=True,
)
print(f"\nGenerated text:\n{generated_text}\n")
print("-" * 50 + "\n")
except KeyboardInterrupt:
print("\n\nExiting...")
if __name__ == "__main__":
run_model()
Congratulations! You’ve completed built a complete GPT-2 implementation from scratch.
If code verification passed, you can execute your step_11.py code with
pixi run gpt2.
What’s next?
You now understand the architectural foundation that powers modern language models. LLaMA, Mistral, and more build on these same components with incremental refinements. You have everything you need to implement those refinements yourself.
Consider extending your implementation with:
- Grouped-query attention (GQA): Reduce memory consumption by sharing key-value pairs across multiple query heads, as used in LLaMA 2.
- Rotary position embeddings (RoPE): Replace learned position embeddings with rotation-based encoding, improving length extrapolation in models like LLaMA and GPT-NeoX.
- SwiGLU activation: Swap GELU for the gated linear unit variant used in LLaMA and PaLM.
- Mixture of experts (MoE): Add sparse expert routing to scale model capacity efficiently, as in Mixtral and GPT-4.
Each refinement builds directly on what you’ve implemented. The attention mechanism you wrote becomes grouped-query attention with a simple modification to how you reshape key-value tensors. Your position embeddings can be replaced with RoPE by changing how you encode positional information. The feed-forward network you built becomes SwiGLU by adding a gating mechanism.
Pick an architecture that interests you and start building. You’ll find the patterns are familiar because the fundamentals haven’t changed.