Chapter · AI

Inference

Turning a trained model into something a user can actually use. Decoding, sampling, quantization, caching, and the systems engineering that makes responses fast and cheap.

Topics

Topic 1

Decoding Strategies

Greedy, beam, sampling — how a probability distribution becomes an actual sequence of tokens.

Planned

Topic 2

Sampling Parameters

Temperature, top-k, top-p — what each knob actually does to outputs.

Planned

Topic 3

Quantization

Shrinking weights from 16 bits to 8, 4, or fewer — and the accuracy you pay for it.

Planned

Topic 4

Speculative Decoding

Using a small model to draft what the big model will probably say.

Planned

Topic 5

Structured Outputs

Constraining generation to JSON, code, or any other grammar.

Planned

Topic 6

Latency, Throughput & TTFT

The metrics inference systems are actually graded on — and the tradeoffs between them.

Planned

Topic 7

Inference Engines & Hardware

vLLM, SGLang, TensorRT-LLM — and the GPUs and accelerators underneath.

Planned