Architecture registration
How arch.py and __init__.py plug the serving package into max serve,
and what each field in SupportedArchitecture controls.
When you run max serve --custom-architectures gpt2_arch, max serve imports
the package and reads the ARCHITECTURES list, which adds GPT-2 to its model
registry.
The package entry point
__init__.py states the contract:
from .arch import gpt2_arch
ARCHITECTURES = [gpt2_arch]
__all__ = ["ARCHITECTURES", "gpt2_arch"]
The architecture declaration
arch.py assembles the
SupportedArchitecture
that MAX registers. Each field tells the serving layer something it needs before
a request arrives:
from __future__ import annotations
from max.graph.weights import WeightsFormat
from max.interfaces import PipelineTask
from max.pipelines.core import TextContext
from max.pipelines.lib import SupportedArchitecture, TextTokenizer
from . import weight_adapters
from .model import GPT2PipelineModel
from .model_config import GPT2ArchConfig
gpt2_arch = SupportedArchitecture(
# Must match the HuggingFace config "architectures" field
name="GPT2LMHeadModel",
task=PipelineTask.TEXT_GENERATION,
example_repo_ids=["gpt2", "openai-community/gpt2"],
default_weights_format=WeightsFormat.safetensors,
default_encoding="float32",
supported_encodings={"float32"},
pipeline_model=GPT2PipelineModel,
tokenizer=TextTokenizer,
context_type=TextContext,
multi_gpu_supported=False,
rope_type="none",
weight_adapters={
WeightsFormat.safetensors: weight_adapters.convert_safetensor_state_dict,
},
config=GPT2ArchConfig,
required_arguments={"enable_prefix_caching": False},
)
name: must match the "architectures" field in Hugging Face’s
config.json exactly. When you run max serve --model gpt2, MAX downloads
the model, reads config.json, and looks up that name in its registry. A
mismatch means the package never loads.
weight_adapters: maps each
WeightsFormat
to a conversion function. When MAX loads the safetensors checkpoint, it calls
weight_adapters.convert_safetensor_state_dict to produce the layout
MaxGPT2LMHeadModel expects.
tokenizer: is
TextTokenizer,
which wraps the Hugging Face tokenizer for the model. Before any token is
processed, max serve calls it to convert the prompt to token IDs and, after
generation, decode the output IDs back to text.
config: points to GPT2ArchConfig, which provides the KV cache
dimensions covered in KV cache configuration.
required_arguments: is a hard constraint on the serving layer:
enable_prefix_caching: False prevents max serve from enabling prefix
caching for this model. GPT-2 passes the full token sequence on every decode
step rather than using an incremental KV cache, so prefix caching doesn’t
apply.
What you’ve built
You’ve built two complete layers of an LLM serving system and wired them together.
The first layer is the model: everything from token embeddings through the language model head, compiled to a MAX graph. The second layer is the serving infrastructure: a weight adapter that maps Hugging Face checkpoints to MAX’s layout, a config class that tells the serving layer how much KV cache to allocate, and a pipeline model that loads, compiles, and executes the graph on demand.
Any max.experimental.nn.Module follows the same pattern to get from model
weights to a live endpoint:
- Implement the model with
max.experimental.nn - Adapt the weights with a
WeightsFormatconverter - Expose cache dimensions with an
ArchConfigWithAttentionKVCachesubclass - Wrap execution in a
PipelineModelWithKVCachesubclass - Register the package as a
SupportedArchitectureand pass--custom-architecturestomax serve
Modern LLMs build on these same components with targeted refinements:
- Grouped-query attention (GQA): share key-value pairs across multiple query heads to reduce memory, as in LLaMA.
- Rotary position embeddings (RoPE): replace learned position embeddings with rotation-based encoding for better length generalization.
- SwiGLU activation: swap GELU for the gated linear unit variant used in LLaMA and Mistral.
- Incremental KV cache: cache key and value tensors across decode steps so each step processes only the new token instead of the full sequence.
Each builds directly on what you’ve read here.
Run the model to see it in action.