Chapter · AI
Inference
Turning a trained model into something a user can actually use. Decoding, sampling, quantization, caching, and the systems engineering that makes responses fast and cheap.
Topics
Topic 1
Decoding Strategies
Greedy, beam, sampling — how a probability distribution becomes an actual sequence of tokens.
Topic 2
Sampling Parameters
Temperature, top-k, top-p — what each knob actually does to outputs.
Topic 3
Quantization
Shrinking weights from 16 bits to 8, 4, or fewer — and the accuracy you pay for it.
Topic 4
Speculative Decoding
Using a small model to draft what the big model will probably say.
Topic 5
Structured Outputs
Constraining generation to JSON, code, or any other grammar.
Topic 6
Latency, Throughput & TTFT
The metrics inference systems are actually graded on — and the tradeoffs between them.
Topic 7
Inference Engines & Hardware
vLLM, SGLang, TensorRT-LLM — and the GPUs and accelerators underneath.