Build an LLM from scratch in MAX

Experimental APIs: The APIs in the experimental package are subject to change. Share feedback on the MAX LLMs forum.

Transformer models power today’s most impactful AI applications, from language models like ChatGPT to code generation tools like GitHub Copilot. Maybe you’ve been asked to adapt one of these models for your team, or you want to understand what’s actually happening when you call an inference API. Either way, building a transformer from scratch is one of the best ways to truly understand how they work.

This guide walks you through implementing GPT-2 using Modular’s MAX framework experimental API. You’ll build each component yourself: embeddings, attention mechanisms, and feed-forward layers. You’ll see how they fit together into a complete language model by completing the sequential coding challenges in the tutorial GitHub repository.

This API is unstable: This tutorial is built on the MAX Experimental API, which we expect to change over time and expand to include new features and functionality. As it evolves, we plan to update the tutorial accordingly. When this API comes out of experimental development, the tutorial content will also enter a more stable state. While in development, this tutorial will be pinned to a major release version.

Why GPT-2?

It’s the architectural foundation for modern language models. LLaMA, Mistral, GPT-4; they’re all built on the same core components you’ll implement here:

  • multi-head attention
  • feed-forward layers
  • layer normalization
  • residual connections

Modern variants add refinements like grouped-query attention or mixture of experts, but the fundamentals remain the same. GPT-2 is complex enough to teach real transformer architecture but simple enough to implement completely and understand deeply. When you grasp how its pieces fit together, you understand how to build any transformer-based model.

Learning by building: This tutorial follows a format popularized by Andrej Karpathy’s educational work and Sebastian Raschka’s hands-on approach. Rather than abstract theory, you’ll implement each component yourself, building intuition through practice.

Why MAX?

Traditional ML development often feels like stitching together tools that weren’t designed to work together. Maybe you write your model in PyTorch, optimize in CUDA, convert to ONNX for deployment, then use separate serving tools. Each handoff introduces complexity.

MAX Framework takes a different approach: everything happens in one unified system. You write code to define your model, load weights, and run inference, all in MAX’s Python API. The MAX Platform handles optimization automatically and you can even use MAX Serve to manage your deployment.

When you build GPT-2 in this guide, you’ll load pretrained weights from Hugging Face, implement the architecture, and run text generation, all in the same environment.

Why coding challenges?

This tutorial emphasizes active problem-solving over passive reading. Each step presents a focused implementation task with:

  1. Clear context: What you’re building and why it matters
  2. Guided implementation: Code structure with specific tasks to complete
  3. Immediate validation: Tests that verify correctness before moving forward
  4. Conceptual grounding: Explanations that connect code to architecture

Rather than presenting complete solutions, this approach helps you develop intuition for when and why to use specific patterns. The skills you build extend beyond GPT-2 to model development more broadly.

You can work through the tutorial sequentially for comprehensive understanding, or skip directly to topics you need. Each step is self-contained enough to be useful independently while building toward a complete implementation.

What you’ll build

This tutorial guides you through building GPT-2 in manageable steps:

StepComponentWhat you’ll learn
1Model configurationDefine architecture hyperparameters matching HuggingFace GPT-2.
2Causal maskingCreate attention masks to prevent looking at future tokens.
3Layer normalizationStabilize activations for effective training.
4GPT-2 MLP (feed-forward network)Build the position-wise feed-forward network with GELU activation.
5Token embeddingsConvert token IDs to continuous vector representations.
6Position embeddingsEncode sequence order information.
7Multi-head attentionExtend to multiple parallel attention heads.
8Residual connections & layer normEnable training deep networks with skip connections.
9Transformer blockCombine attention and MLP into the core building block.
10Stacking transformer blocksCreate the complete 12-layer GPT-2 model.
11Language model headProject hidden states to vocabulary logits.
12Text generationGenerate text autoregressively with temperature sampling.
00Serve your modelRun GPT-2 as an endpoint with MAX Serve…. coming soon.

By the end, you’ll have a complete GPT-2 implementation and practical experience with MAX’s Python API. These are skills you can immediately apply to your own projects.

To install the puzzles, follow the steps in Setup.