Architecture registration

How arch.py and __init__.py plug the serving package into max serve, and what each field in SupportedArchitecture controls.

When you run max serve --custom-architectures gpt2_arch, max serve imports the package and reads the ARCHITECTURES list, which adds GPT-2 to its model registry.

The package entry point

__init__.py states the contract:

__init__.py
from .arch import gpt2_arch

ARCHITECTURES = [gpt2_arch]

__all__ = ["ARCHITECTURES", "gpt2_arch"]

The architecture declaration

arch.py assembles the SupportedArchitecture that MAX registers. Each field tells the serving layer something it needs before a request arrives:

arch.py
from __future__ import annotations

from max.graph.weights import WeightsFormat
from max.interfaces import PipelineTask
from max.pipelines.core import TextContext
from max.pipelines.lib import SupportedArchitecture, TextTokenizer

from . import weight_adapters
from .model import GPT2PipelineModel
from .model_config import GPT2ArchConfig

gpt2_arch = SupportedArchitecture(
    # Must match the HuggingFace config "architectures" field
    name="GPT2LMHeadModel",
    task=PipelineTask.TEXT_GENERATION,
    example_repo_ids=["gpt2", "openai-community/gpt2"],
    default_weights_format=WeightsFormat.safetensors,
    default_encoding="float32",
    supported_encodings={"float32"},
    pipeline_model=GPT2PipelineModel,
    tokenizer=TextTokenizer,
    context_type=TextContext,
    multi_gpu_supported=False,
    rope_type="none",
    weight_adapters={
        WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
    },
    config=GPT2ArchConfig,
    required_arguments={"enable_prefix_caching": False},
)

name: must match the "architectures" field in Hugging Face’s config.json exactly. When you run max serve --model gpt2, MAX downloads the model, reads config.json, and looks up that name in its registry. A mismatch means the package never loads.

weight_adapters: maps each WeightsFormat to a conversion function. When MAX loads the safetensors checkpoint, it calls weight_adapters.convert_safetensor_state_dict to produce the layout MaxGPT2LMHeadModel expects.

tokenizer: is TextTokenizer, which wraps the Hugging Face tokenizer for the model. Before any token is processed, max serve calls it to convert the prompt to token IDs and, after generation, decode the output IDs back to text.

config: points to GPT2ArchConfig, which provides the KV cache dimensions covered in KV cache configuration.

required_arguments: is a hard constraint on the serving layer: enable_prefix_caching: False prevents max serve from enabling prefix caching for this model. GPT-2 passes the full token sequence on every decode step rather than using an incremental KV cache, so prefix caching doesn’t apply.

What you’ve built

You’ve built two complete layers of an LLM serving system and wired them together.

The first layer is the model: everything from token embeddings through the language model head, compiled to a MAX graph. The second layer is the serving infrastructure: a weight adapter that maps Hugging Face checkpoints to MAX’s layout, a config class that tells the serving layer how much KV cache to allocate, and a pipeline model that loads, compiles, and executes the graph on demand.

Any max.experimental.nn.Module follows the same pattern to get from model weights to a live endpoint:

  1. Implement the model with max.experimental.nn
  2. Adapt the weights with a WeightsFormat converter
  3. Expose cache dimensions with an ArchConfigWithAttentionKVCache subclass
  4. Wrap execution in a PipelineModelWithKVCache subclass
  5. Register the package as a SupportedArchitecture and pass --custom-architectures to max serve

Modern LLMs build on these same components with targeted refinements:

  • Grouped-query attention (GQA): share key-value pairs across multiple query heads to reduce memory, as in LLaMA.
  • Rotary position embeddings (RoPE): replace learned position embeddings with rotation-based encoding for better length generalization.
  • SwiGLU activation: swap GELU for the gated linear unit variant used in LLaMA and Mistral.
  • Incremental KV cache: cache key and value tensors across decode steps so each step processes only the new token instead of the full sequence.

Each builds directly on what you’ve read here.

Run the model to see it in action.