Architecture registration

How arch.py and __init__.py plug the serving package into max serve, and what each field in SupportedArchitecture controls.

When you run max serve --custom-architectures gpt2_arch, max serve imports the package and reads the ARCHITECTURES list, which adds GPT-2 to its model registry.

The package entry point

__init__.py states the contract:

from .arch import gpt2_arch

ARCHITECTURES = [gpt2_arch]

__all__ = ["ARCHITECTURES", "gpt2_arch"]

The architecture declaration

arch.py assembles the SupportedArchitecture that MAX registers. Each field tells the serving layer something it needs before a request arrives:

from __future__ import annotations

from max.graph.weights import WeightsFormat
from max.pipelines.context import TextContext
from max.pipelines.lib import SupportedArchitecture, TextTokenizer
from max.pipelines.modeling.types import PipelineTask

from . import weight_adapters
from .model import GPT2PipelineModel
from .model_config import GPT2ArchConfig

gpt2_arch = SupportedArchitecture(
    # Must match the HuggingFace config "architectures" field
    name="GPT2LMHeadModel",
    task=PipelineTask.TEXT_GENERATION,
    example_repo_ids=["gpt2", "openai-community/gpt2"],
    default_weights_format=WeightsFormat.safetensors,
    default_encoding="float32",
    supported_encodings={"float32"},
    pipeline_model=GPT2PipelineModel,
    tokenizer=TextTokenizer,
    context_type=TextContext,
    multi_gpu_supported=False,
    rope_type="none",
    weight_adapters={
        WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
    },
    config=GPT2ArchConfig,
    required_arguments={"enable_prefix_caching": False},
)

name: must match the "architectures" field in Hugging Face’s config.json exactly. When you run max serve --model openai-community/gpt2, MAX downloads the model, reads config.json, and looks up that name in its registry. A mismatch means the package never loads.

weight_adapters: maps each WeightsFormat to a conversion function. When MAX loads the safetensors checkpoint, it calls weight_adapters.convert_safetensor_state_dict to produce the layout MaxGPT2LMHeadModel expects.

tokenizer: is TextTokenizer, which wraps the Hugging Face tokenizer for the model. Before any token is processed, max serve calls it to convert the prompt to token IDs and, after generation, decode the output IDs back to text.

config: points to GPT2ArchConfig, which provides the KV cache dimensions covered in KV cache configuration.

required_arguments: is a hard constraint on the serving layer: enable_prefix_caching: False prevents max serve from enabling prefix caching for this model. GPT-2 passes the full token sequence on every decode step rather than using an incremental KV cache, so prefix caching doesn’t apply.

Two more constraints live on the max serve command line rather than in required_arguments — that’s why the serve command ends with --no-enable-overlap-scheduler --no-device-graph-capture --force. For text-generation models on GPU, MAX enables the overlap scheduler and device graph capture by default to raise throughput, but both expect the standard incremental-decode path. Because this model replays the full sequence on every step (the same reason prefix caching doesn’t apply), the serve command disables them. --force is required because MAX otherwise re-enables the overlap scheduler during startup.

What you’ve built

You’ve built two complete layers of an LLM serving system and wired them together.

The first layer is the model: everything from token embeddings through the language model head, compiled to a MAX graph. The second layer is the serving infrastructure: a weight adapter that maps Hugging Face checkpoints to MAX’s layout, a config class that tells the serving layer how much KV cache to allocate, and a pipeline model that loads, compiles, and executes the graph on demand.

Any max.experimental.nn.Module follows the same pattern to get from model weights to a live endpoint:

Implement the model with max.experimental.nn
Adapt the weights with a WeightsFormat converter
Expose cache dimensions with an ArchConfigWithAttentionKVCache subclass
Wrap execution in a PipelineModelWithKVCache subclass
Register the package as a SupportedArchitecture and pass --custom-architectures to max serve

Modern LLMs build on these same components with targeted refinements:

Grouped-query attention (GQA): share key-value pairs across multiple query heads to reduce memory, as in LLaMA.
Rotary position embeddings (RoPE): replace learned position embeddings with rotation-based encoding for better length generalization.
SwiGLU activation: swap GELU for the gated linear unit variant used in LLaMA and Mistral.
Incremental KV cache: cache key and value tensors across decode steps so each step processes only the new token instead of the full sequence.

Each builds directly on what you’ve read here.

Run the model to see it in action.