Encode and decode tokens

Learn to convert between text and token IDs using tokenizers and MAX tensors.

In this step, you’ll implement utility functions to bridge the gap between text and the token IDs your model operates on. The encode_text() function converts an input string into a tensor of token IDs, while decode_tokens() converts token IDs into a string.

As you saw when building the model body in step 7 (MaxGPT2Model), the model must receive input as token IDs (not raw text). The token IDs are integers that represent pieces of text according to a tokenizer vocabulary. GPT-2 uses a Byte Pair Encoding (BPE) tokenizer, which breaks text into subword units. For example, “Hello world” becomes [15496, 995] - two tokens representing the words.

You’ll use the Hugging Face tokenizer to handle the text-to-token conversion, then wrap it with functions that work with MAX tensors. This separation keeps tokenization (a preprocessing step) separate from model inference (tensor operations).

Understanding tokenization

Tokenization converts text to a list of integers. The GPT-2 tokenizer uses a vocabulary of 50,257 tokens, where common words get single tokens and rare words split into subwords.

The HuggingFace tokenizer provides an encode method that takes text and returns a Python list of token IDs. For example:

token_ids = tokenizer.encode("Hello world")  # Returns [15496, 995]

You can specify max_length and truncation=True to limit sequence length. If the text exceeds max_length, the tokenizer cuts it off. This prevents memory issues with very long inputs.

After encoding, you need to convert the Python list to a MAX tensor. Use Tensor.constant to create a tensor with the token IDs, specifying dtype=DType.int64 (GPT-2 expects 64-bit integers) and the target device.

The tensor needs shape [batch, seq_length] for model input. Wrap the token list in another list to add the batch dimension: [token_ids] becomes [[15496, 995]] with shape [1, 2].

Understanding decoding

Decoding reverses tokenization: convert token IDs back to text. This requires moving tensors from GPU to CPU, converting to NumPy, then using the tokenizer’s decode method.

First, transfer the tensor to CPU with .to(CPU()). MAX tensors can live on GPU or CPU, but Python libraries like NumPy only work with CPU data.

Next, convert to NumPy using np.from_dlpack. DLPack is a standard that enables zero-copy tensor sharing between frameworks. The MAX tensor and NumPy array share the same underlying memory.

If the tensor is 2D (batch dimension present), flatten it to 1D with .flatten(). The tokenizer expects a flat list of token IDs, not a batched format.

Finally, convert to a Python list with .tolist() and decode with tokenizer.decode(token_ids, skip_special_tokens=True). The skip_special_tokens=True parameter removes padding and end-of-sequence markers from the output.

MAX operations

You’ll use the following MAX operations to complete this task:

Tensor creation:

Device transfer:

NumPy interop:

  • np.from_dlpack(tensor): Converts MAX tensor to NumPy using DLPack protocol

Implementing tokenization

You’ll create two functions: encode_text to convert strings to tensors, and decode_tokens to convert tensors back to strings.

First, import the required modules. You’ll need numpy as np for array operations, CPU from MAX’s driver for device specification, DType for specifying integer types, and Tensor for creating and manipulating tensors.

In encode_text, implement the encoding and conversion:

  1. Encode the text to token IDs using the tokenizer: token_ids = tokenizer.encode(text, max_length=max_length, truncation=True)
  2. Convert to a MAX tensor with batch dimension: Tensor.constant([token_ids], dtype=DType.int64, device=device)

Note the [token_ids] wrapping to create the batch dimension. This gives shape [1, seq_length] instead of just [seq_length].

In decode_tokens, implement the reverse process with explicit type conversions:

  1. Transfer to CPU and convert to NumPy with explicit type annotation: token_ids_np: np.ndarray = np.from_dlpack(token_ids.to(CPU()))
  2. Flatten if needed: if token_ids_np.ndim > 1: token_ids_np = token_ids_np.flatten()
  3. Convert to Python list with explicit type annotation: token_ids_list: list = token_ids_np.tolist()
  4. Decode to text: return tokenizer.decode(token_ids_list, skip_special_tokens=True)

Note the use of separate variable names (token_ids_np, token_ids_list) instead of reusing the same variable. This makes the type conversions explicit and improves code clarity: Tensornp.ndarrayliststr. The flattening step handles both 1D and 2D tensors, making the function work with single sequences or batches.

Implementation (step_09.py):

# ===----------------------------------------------------------------------=== #
#
# This file is Modular Inc proprietary.
#
# ===----------------------------------------------------------------------=== #
"""
Step 09: Encode and decode tokens

This module provides utility functions to tokenize input text
and decode token IDs back to text using a tokenizer.

Tasks:
1. Tokenize text and convert to tensor
2. Decode token IDs back to text

Run: pixi run s09
"""

# TODO: Import required modules
# Hint: You'll need numpy as np
# Hint: You'll need CPU from max.driver
# Hint: You'll need DType from max.dtype
# Hint: You'll need Tensor from max.tensor

from max.driver import Device
from max.tensor import Tensor
from transformers import GPT2Tokenizer


def encode_text(
    text: str, tokenizer: GPT2Tokenizer, device: Device, max_length: int = 128
) -> Tensor:
    """Tokenize text and convert to tensor.

    Args:
        text: Input text to tokenize
        tokenizer: HuggingFace tokenizer
        device: Device to place tensor on
        max_length: Maximum sequence length

    Returns:
        Tensor of token IDs with shape [1, seq_length]
    """
    # TODO: Encode text to token IDs
    # Hint: token_ids = tokenizer.encode(text, max_length=max_length, truncation=True)
    pass

    # TODO: Convert to MAX tensor
    # Hint: return Tensor.constant([token_ids], dtype=DType.int64, device=device)
    # Note: Wrap tokens in a list to create batch dimension
    return None


def decode_tokens(token_ids: Tensor, tokenizer: GPT2Tokenizer) -> str:
    """Decode token IDs back to text.

    Args:
        token_ids: Tensor of token IDs
        tokenizer: HuggingFace tokenizer

    Returns:
        Decoded text string
    """
    # TODO: Convert MAX tensor to NumPy array explicitly
    # Hint: Create a new variable with type annotation: token_ids_np: np.ndarray
    # Hint: token_ids_np = np.from_dlpack(token_ids.to(CPU()))
    # Note: This makes the type conversion from Tensor to np.ndarray explicit
    pass

    # TODO: Flatten if needed
    # Hint: if token_ids_np.ndim > 1: token_ids_np = token_ids_np.flatten()
    pass

    # TODO: Convert to Python list explicitly
    # Hint: Create a new variable: token_ids_list: list = token_ids_np.tolist()
    # Note: This makes the conversion from np.ndarray to list explicit
    pass

    # TODO: Decode to text
    # Hint: return tokenizer.decode(token_ids_list, skip_special_tokens=True)
    return None

Validation

Run pixi run s09 to verify your implementation correctly converts text to tensors and back.

Show solution
# ===----------------------------------------------------------------------=== #
#
# This file is Modular Inc proprietary.
#
# ===----------------------------------------------------------------------=== #
"""
Solution for Step 09: Encode and decode tokens

This module provides utility functions to tokenize input text
and decode token IDs back to text using a tokenizer.
"""

import numpy as np
from max.driver import CPU, Device
from max.dtype import DType
from max.tensor import Tensor
from transformers import GPT2Tokenizer


def encode_text(
    text: str, tokenizer: GPT2Tokenizer, device: Device, max_length: int = 128
) -> Tensor:
    """Tokenize text and convert to tensor."""
    token_ids = tokenizer.encode(text, max_length=max_length, truncation=True)
    return Tensor.constant([token_ids], dtype=DType.int64, device=device)


def decode_tokens(token_ids: Tensor, tokenizer: GPT2Tokenizer) -> str:
    """Decode token IDs back to text."""
    token_ids_np: np.ndarray = np.from_dlpack(token_ids.to(CPU()))
    if token_ids_np.ndim > 1:
        token_ids_np = token_ids_np.flatten()
    token_ids_list: list = token_ids_np.tolist()
    return tokenizer.decode(token_ids_list, skip_special_tokens=True)

Next: In Step 10, you’ll implement the text generation loop that uses these functions to produce coherent text autoregressively.