Encode and decode tokens
Learn to convert between text and token IDs using tokenizers and MAX tensors.
In this step, you’ll implement utility functions to bridge the gap between text
and the token IDs your model operates on. The encode_text() function converts
an input string into a tensor of token IDs, while decode_tokens() converts
token IDs into a string.
As you saw when building the model body in step 7 (MaxGPT2Model), the model
must receive input as token IDs (not raw text). The token IDs are integers that
represent pieces of text according to a tokenizer vocabulary. GPT-2 uses a Byte
Pair Encoding (BPE) tokenizer, which breaks text into subword units. For
example, “Hello world” becomes [15496, 995] - two tokens representing the
words.
You’ll use the Hugging Face tokenizer to handle the text-to-token conversion, then wrap it with functions that work with MAX tensors. This separation keeps tokenization (a preprocessing step) separate from model inference (tensor operations).
Understanding tokenization
Tokenization converts text to a list of integers. The GPT-2 tokenizer uses a vocabulary of 50,257 tokens, where common words get single tokens and rare words split into subwords.
The HuggingFace tokenizer provides an encode method that takes text and
returns a Python list of token IDs. For example:
token_ids = tokenizer.encode("Hello world") # Returns [15496, 995]
You can specify max_length and truncation=True to limit sequence length. If
the text exceeds max_length, the tokenizer cuts it off. This prevents memory
issues with very long inputs.
After encoding, you need to convert the Python list to a MAX tensor. Use
Tensor.constant to create a tensor with the token IDs, specifying
dtype=DType.int64 (GPT-2 expects 64-bit integers) and the target device.
The tensor needs shape [batch, seq_length] for model input. Wrap the token
list in another list to add the batch dimension: [token_ids] becomes
[[15496, 995]] with shape [1, 2].
Understanding decoding
Decoding reverses tokenization: convert token IDs back to text. This requires
moving tensors from GPU to CPU, converting to NumPy, then using the tokenizer’s
decode method.
First, transfer the tensor to CPU with .to(CPU()). MAX tensors can live on GPU
or CPU, but Python libraries like NumPy only work with CPU data.
Next, convert to NumPy using np.from_dlpack. DLPack is a standard that enables
zero-copy tensor sharing between frameworks. The MAX tensor and NumPy array
share the same underlying memory.
If the tensor is 2D (batch dimension present), flatten it to 1D with
.flatten(). The tokenizer expects a flat list of token IDs, not a batched
format.
Finally, convert to a Python list with .tolist() and decode with
tokenizer.decode(token_ids, skip_special_tokens=True). The
skip_special_tokens=True parameter removes padding and end-of-sequence markers
from the output.
You’ll use the following MAX operations to complete this task:
Tensor creation:
Tensor.constant(data, dtype, device): Creates tensor from Python data
Device transfer:
tensor.to(CPU()): Moves tensor to CPU for NumPy conversion
NumPy interop:
np.from_dlpack(tensor): Converts MAX tensor to NumPy using DLPack protocol
Implementing tokenization
You’ll create two functions: encode_text to convert strings to tensors, and
decode_tokens to convert tensors back to strings.
First, import the required modules. You’ll need numpy as np for array
operations, CPU from MAX’s driver for device specification, DType for
specifying integer types, and Tensor for creating and manipulating tensors.
In encode_text, implement the encoding and conversion:
- Encode the text to token IDs using the tokenizer:
token_ids = tokenizer.encode(text, max_length=max_length, truncation=True) - Convert to a MAX tensor with batch dimension:
Tensor.constant([token_ids], dtype=DType.int64, device=device)
Note the [token_ids] wrapping to create the batch dimension. This gives shape
[1, seq_length] instead of just [seq_length].
In decode_tokens, implement the reverse process with explicit type conversions:
- Transfer to CPU and convert to NumPy with explicit type annotation:
token_ids_np: np.ndarray = np.from_dlpack(token_ids.to(CPU())) - Flatten if needed:
if token_ids_np.ndim > 1: token_ids_np = token_ids_np.flatten() - Convert to Python list with explicit type annotation:
token_ids_list: list = token_ids_np.tolist() - Decode to text:
return tokenizer.decode(token_ids_list, skip_special_tokens=True)
Note the use of separate variable names (token_ids_np, token_ids_list)
instead of reusing the same variable. This makes the type conversions explicit
and improves code clarity: Tensor → np.ndarray → list → str. The
flattening step handles both 1D and 2D tensors, making the function work with
single sequences or batches.
Implementation (step_09.py):
# ===----------------------------------------------------------------------=== #
#
# This file is Modular Inc proprietary.
#
# ===----------------------------------------------------------------------=== #
"""
Step 09: Encode and decode tokens
This module provides utility functions to tokenize input text
and decode token IDs back to text using a tokenizer.
Tasks:
1. Tokenize text and convert to tensor
2. Decode token IDs back to text
Run: pixi run s09
"""
# TODO: Import required modules
# Hint: You'll need numpy as np
# Hint: You'll need CPU from max.driver
# Hint: You'll need DType from max.dtype
# Hint: You'll need Tensor from max.tensor
from max.driver import Device
from max.tensor import Tensor
from transformers import GPT2Tokenizer
def encode_text(
text: str, tokenizer: GPT2Tokenizer, device: Device, max_length: int = 128
) -> Tensor:
"""Tokenize text and convert to tensor.
Args:
text: Input text to tokenize
tokenizer: HuggingFace tokenizer
device: Device to place tensor on
max_length: Maximum sequence length
Returns:
Tensor of token IDs with shape [1, seq_length]
"""
# TODO: Encode text to token IDs
# Hint: token_ids = tokenizer.encode(text, max_length=max_length, truncation=True)
pass
# TODO: Convert to MAX tensor
# Hint: return Tensor.constant([token_ids], dtype=DType.int64, device=device)
# Note: Wrap tokens in a list to create batch dimension
return None
def decode_tokens(token_ids: Tensor, tokenizer: GPT2Tokenizer) -> str:
"""Decode token IDs back to text.
Args:
token_ids: Tensor of token IDs
tokenizer: HuggingFace tokenizer
Returns:
Decoded text string
"""
# TODO: Convert MAX tensor to NumPy array explicitly
# Hint: Create a new variable with type annotation: token_ids_np: np.ndarray
# Hint: token_ids_np = np.from_dlpack(token_ids.to(CPU()))
# Note: This makes the type conversion from Tensor to np.ndarray explicit
pass
# TODO: Flatten if needed
# Hint: if token_ids_np.ndim > 1: token_ids_np = token_ids_np.flatten()
pass
# TODO: Convert to Python list explicitly
# Hint: Create a new variable: token_ids_list: list = token_ids_np.tolist()
# Note: This makes the conversion from np.ndarray to list explicit
pass
# TODO: Decode to text
# Hint: return tokenizer.decode(token_ids_list, skip_special_tokens=True)
return None
Validation
Run pixi run s09 to verify your implementation correctly converts text to
tensors and back.
Show solution
# ===----------------------------------------------------------------------=== #
#
# This file is Modular Inc proprietary.
#
# ===----------------------------------------------------------------------=== #
"""
Solution for Step 09: Encode and decode tokens
This module provides utility functions to tokenize input text
and decode token IDs back to text using a tokenizer.
"""
import numpy as np
from max.driver import CPU, Device
from max.dtype import DType
from max.tensor import Tensor
from transformers import GPT2Tokenizer
def encode_text(
text: str, tokenizer: GPT2Tokenizer, device: Device, max_length: int = 128
) -> Tensor:
"""Tokenize text and convert to tensor."""
token_ids = tokenizer.encode(text, max_length=max_length, truncation=True)
return Tensor.constant([token_ids], dtype=DType.int64, device=device)
def decode_tokens(token_ids: Tensor, tokenizer: GPT2Tokenizer) -> str:
"""Decode token IDs back to text."""
token_ids_np: np.ndarray = np.from_dlpack(token_ids.to(CPU()))
if token_ids_np.ndim > 1:
token_ids_np = token_ids_np.flatten()
token_ids_list: list = token_ids_np.tolist()
return tokenizer.decode(token_ids_list, skip_special_tokens=True)
Next: In Step 10, you’ll implement the text generation loop that uses these functions to produce coherent text autoregressively.