Chapter · AI
Evaluation
How do you grade a model that can do almost anything? The benchmarks, the methodology, the metrics — and why every claim about model capability deserves scrutiny.
Topics
Topic 1
Evaluation Methodology
What does it actually mean to evaluate a model that can do almost anything?
Topic 2
Benchmarks & Benchmaxxing
The standard tests, what they measure, and how they get gamed.
Topic 3
LLM-as-Judge
Using a strong model to grade other models — and the biases this introduces.
Topic 4
Metrics
Accuracy, F1, BLEU, perplexity, pass@k — picking the right one for the task.
Topic 5
Golden Datasets
The hand-curated test sets that ground every other measurement.
Topic 6
Data Contamination
When the test set leaks into training, and how to detect it.