Introduction
Setup
Model configuration
Causal masking
Layer normalization
Feed-forward network
Token embeddings
Position embeddings
Multi-head attention
Residual connections and layer normalization
Transformer block
Stacking transformer blocks
Language model head
Text generation