I recently watched Andrej Karpathy’s Deep Dive into LLMs like ChatGPT — an excellent hands-on walkthrough with concrete examples. Here’s my visual summary for future reference.
The process of tokenization can be visualized by the Tiktokenizer App.
"Hello world" → [15496, 995] (2 tokens)
"hello world" → [31373, 995] (2 tokens - different IDs due to case!)
"HelloWorld" → [15496, 10603] (2 tokens - camelCase split)
How LLMs generate text (token by token):
Input: "The cat sat on the"
1. Model predicts probability distribution for next token
- "mat": 15%, "floor": 12%, "roof": 8%, "dog": 0.1%...
2. Sample from distribution (stochastic!)
- Selected: "mat"
3. Append to context: "The cat sat on the mat"
4. REPEAT until <end> token or max length
- "The cat sat on the mat and purred softly..."
- Each run produces slightly different output (stochastic)
- Context window limits how much the model can "remember"
<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|>
<|im_start|>user<|im_sep|>What is 4 + 4?<|im_end|>
<|im_start|>assistant<|im_sep|>4 + 4 = 8<|im_end|>
Model learns to use tools when uncertain:
User: "Who is Orson Kovacs?"
1. Model recognizes it doesn't know
- Outputs: <SEARCH_START>Who is Orson Kovacs?<SEARCH_END>
2. System executes search, injects results
- [Wikipedia article, news articles, etc.]
3. Model generates answer using search results
- "Orson Kovacs is a fictional character..."
4. If still unsure → REPEAT search with refined query
- Uses working memory (context) instead of vague recall (params)
- Dramatically reduces hallucinations for factual queries
Teaching models to recognize their knowledge limits:
1. EXTRACT: Take a snippet from training data
- "The Eiffel Tower weighs approximately 10,100 tons"
2. GENERATE: Create a factual question about it
- "What is the mass of the Eiffel Tower?"
3. ANSWER: Have the model answer the question
- Model says: "The Eiffel Tower weighs 7,300 tons"
4. SCORE: Compare answer against original source
- Wrong! Model hallucinated.
5. TRAIN: Teach model to refuse or use tools when unsure
- "I don't have reliable information about..."
6. REPEAT → Model learns its knowledge boundaries
- Model learns WHEN to say "I don't know"
- Model learns WHEN to use search instead of guessing
Problem: Emily buys 3 apples and 2 oranges. Each orange costs $2.
Total cost is $13. What is the cost of each apple?
THE RL LOOP (repeated millions of times):
1. GENERATE: Model produces many solutions
- Solution A: "The answer is $3."
- Solution B: "Oranges: 2×$2=$4, Apples: $13-$4=$9..."
- Solution C: "Each apple costs $5." (wrong)
2. SCORE: Check which solutions are correct
- A: correct but no reasoning
- B: correct WITH step-by-step reasoning
- C: wrong
3. TRAIN: Update model weights on winning solutions
- Model learns to prefer B's reasoning style
4. REPEAT → Model gets better at reasoning over time
- No human labels needed - correctness is automatically verified
- This is how models develop "Aha moments" (DeepSeek paper)
For unverifiable domains (jokes, writing, style):
PHASE 1: Train the Reward Model
─────────────────────────────────────────────────────────
1. Task: "Write a joke about pelicans"
2. Model generates 5 different jokes
3. Humans rank them: B > D > A > E > C
4. Train reward model on these preferences
5. REPEAT until reward model is accurate
PHASE 2: Use Reward Model at Scale
─────────────────────────────────────────────────────────
1. GENERATE: Model produces many responses
2. SCORE: Reward model rates each one
3. TRAIN: Update on highest-scored responses
4. REPEAT (but CAPPED to ~100s of iterations)
WHY CAPPED?
After too many iterations, model finds adversarial exploits
- Might output "the the the the" to game reward model
- This is called "reward hacking"
- Humans can't label millions of examples, but reward model can
- Exploits "discriminator-generator gap" (easier to judge than create)