Encode and decode tokens

Convert between text and token IDs using the HuggingFace GPT-2 tokenizer.

The model operates on integer token IDs, not raw text. encode_text and decode_tokens bridge the gap, wrapping the HuggingFace tokenizer in a minimal interface.

How GPT-2 tokenizes text

GPT-2 uses Byte Pair Encoding (BPE): it breaks text into subword units drawn from a vocabulary of 50,257 tokens. Common words get a single token; rarer words are split into pieces. For example, “Hello world” becomes [15496, 995].

The tokenizer handles all the vocabulary details. The encode and decode functions just call it and pass the results along.

The code

def encode_text(
    text: str, tokenizer: GPT2Tokenizer, max_length: int = 128
) -> list[int]:
    """Tokenize text and return token IDs as a plain Python list."""
    return tokenizer.encode(text, max_length=max_length, truncation=True)


def decode_tokens(token_ids: list[int], tokenizer: GPT2Tokenizer) -> str:
    """Decode a list of token IDs back to text."""
    return tokenizer.decode(token_ids, skip_special_tokens=True)


encode_text returns a plain Python list[int]—the token IDs are kept as Python data at this stage and only converted to a MAX tensor when needed in the generation loop (Section 10).

decode_tokens takes a list[int] and returns a string. skip_special_tokens=True removes the EOS and padding markers that GPT-2 uses internally from the decoded text.

The functions accept the tokenizer as a parameter rather than capturing it as a global, making them straightforward to test and reuse.

Next: Section 10 builds the generation loop that uses these functions to produce text autoregressively.