Scaling LLM Inference
A practitioner's account of taking a 70B-parameter model from a research notebook to 10,000 requests per second.
Large language models trained on academic budgets often die quietly when handed to a production team. The gap between "the notebook works" and "ten thousand users hit it on a Tuesday" is wider than most teams plan for. This paper walks through the architecture choices that closed that gap for one open-source 70B-parameter model, with measurements at each step.
Background
Inference for transformer models has two distinct costs:
- Prefill — running the full input prompt through the model once. Compute-bound; scales with sequence length squared due to attention.
- Decode — generating one token at a time, each conditioned on all prior tokens. Memory-bound; dominated by KV-cache reads.
Naive serving conflates the two. A request that takes 200ms to prefill might then spend 5 seconds in decode — and during that decode, the GPU is mostly idle waiting for memory. Modern inference servers (vLLM, TGI, TensorRT-LLM) treat prefill and decode as separable workloads, batching them differently.
Prior work
The "Orca" paper (OSDI 2022) introduced continuous batching — replacing the per-request batch with a per-token batch, so finished sequences leave the batch immediately and new ones join. This alone gave 23× throughput on the BLOOM-176B model versus naive dynamic batching.
PagedAttention (vLLM, 2023) added paged memory management for the KV-cache, treating GPU memory like virtual memory in an OS — fragmentation drops, effective capacity rises 2–4×.
Speculative decoding (Leviathan et al., 2023) uses a small draft model to propose multiple tokens, which the large model verifies in parallel. For models with high agreement between draft and target, this is a 2–3× decode-rate improvement.
Core argument
Production inference is not a single optimization problem. It's three problems stacked:
- Throughput — tokens per second per dollar of GPU
- Latency — time-to-first-token (TTFT) and inter-token-latency (ITL)
- Tail behavior — what the p99.9 user experiences
Optimizations that improve one often regress another. Continuous batching lifts throughput by 10× but raises TTFT for any request that joins a busy batch. Speculative decoding lifts decode rate but eats prefill compute. The job of a serving system is to expose these trade-offs as tunable, not to claim a single winning configuration.
Results
Measurements from a 70B-parameter dense model, served on 8× H100 GPUs with tensor-parallel splitting:
Prop
Type
TTFT held within 250ms (p50) and 800ms (p99) across all configurations. ITL stayed below 40ms (p99) once speculative decoding was tuned.
Discussion
Two limitations worth naming:
Workload mix matters more than headline numbers. Speculative decoding helps short generations and hurts long ones — for a 4,096-token completion, the draft model's mispredictions compound and you end up doing extra verification work. We saw a 1.4× regression when the workload shifted from chat (50-token responses) to code generation (1,000+ tokens).
FP8 quantization is not free. On reasoning benchmarks (MMLU, MATH), FP8 lost 0.8–1.4 points versus BF16. For factual recall benchmarks, the loss was within noise. If your users are doing chain-of-thought work, measure before shipping FP8.
Don't blindly stack optimizations
Each technique in the table above was deployed individually, measured, and only kept if the latency tradeoff was acceptable. Stacking them in one shot makes regressions impossible to attribute.
Conclusion
Going from 1× to 30× throughput is not a single technique — it's the disciplined application of four, each carefully measured against latency and quality. The serving system that gets this right is the one that exposes these as runtime knobs rather than build-time decisions.
For teams just starting out: deploy continuous batching first. It's the largest single win, the most stable, and the easiest to reason about. Layer the rest on once you have measurements to defend each one.
References
- Yu et al., "Orca: A Distributed Serving System for Transformer-Based Generative Models" (OSDI 2022)
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
- Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding" (ICML 2023)
- Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (NeurIPS 2022)
How is this guide?