LLM Inference Optimization — Speculative Decoding, KV Cache, and Quantization in Practice
Concrete techniques to reduce cost and latency in large language model inference, with documented trade-offs and recommendations for when to apply each one.
Executive summary
Concrete techniques to reduce cost and latency in large language model inference, with documented trade-offs and recommendations for when to apply each one.
Last updated: 3/26/2026
Sources
This article does not list external links. Sources will appear here when provided.
Why this matters now
Most engineering teams have moved past the "run an LLM locally" phase to the "serve an LLM in production with controlled costs" phase. And the realization is unanimous: inference is where the money actually goes. The cost of serving a multi-billion parameter model at non-trivial volume easily surpasses training costs within a few months.
The good news: in 2026, the inference optimization ecosystem has matured enough that most techniques are applicable without deep application rewrites.
Quantization — the technique with the best ROI
Quantization reduces the numerical precision of model weights (from FP16 to INT8, INT4, or even INT3), cutting memory usage and accelerating computation with minimal quality loss.
Key approaches
- GPTQ: one-shot post-training quantization, widely supported. Good for models you won't retrain. Noticeable quality degradation below 4 bits on complex tasks.
- AWQ: protects important model channels (salient activations), maintaining superior quality to GPTQ at 4 bits with similar deployment cost.
- bitsandbytes (NF4): direct Hugging Face integration, dynamic loading, ideal for prototyping. Slightly slower than GPTQ/AWQ in raw throughput.
When to use
python# Example with bitsandbytes — loading in 4-bit
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-70B",
quantization_config=quantization_config,
device_map="auto"
)For most production cases, AWQ at 4 bits offers the best balance between quality and performance. Reserve GPTQ for when you need maximum speed and can tolerate slight degradation.
KV Cache — the bottleneck everyone ignores
The KV cache stores attention key-value vectors for previous tokens, avoiding recomputation. The problem: in large models with long contexts, it consumes more memory than the model weights themselves.
Optimization strategies
- PagedAttention (vLLM): manages KV cache as virtual memory pages, eliminating fragmentation and allowing effectively larger batch sizes. Typical 2-4x memory reduction vs contiguous allocation.
- Windowed attention: limits attention context to a sliding window. Appropriate when relevant information is concentrated at the end of the prompt. Quality loss on tasks depending on long-range recall.
- KV cache compression: techniques like H2O (Heavy Hitter Oracle) discard less relevant cache entries. Useful for very long contexts (>16K tokens).
Practical configuration with vLLM
pythonfrom vllm import LLM
llm = LLM(
model="meta-llama/Llama-3-70B-AWQ",
tensor_parallel_size=4,
gpu_memory_utilization=0.92,
max_model_len=8192,
kv_cache_dtype="fp8_e5m2" # additional KV cache compression
)Continuous Batching — serving multiple users efficiently
Static batching (waiting for N requests to process together) is simple but introduces latency. Continuous batching (dynamic batching) processes requests as they arrive, adding and removing from the batch in real time.
Frameworks like vLLM and TGI implement this natively. The gains are significant: without continuous batching, throughput drops 3-5x when traffic is variable. In 2026, there's no reason to use static batching in production.
Speculative Decoding — faster inference without quality reduction
Speculative decoding uses a smaller model (draft model) to generate multiple candidate tokens, which are validated in parallel by the large model (target model). Correct tokens are kept; incorrect ones are discarded and regenerated.
When it's worth it:
- When you have a large model serving in production and can run a smaller model on the same hardware
- Per-token latency doesn't change much, but per-batch latency improves significantly
- The ideal draft model is 10-20% of the target model's size, from the same family
When it's not:
- If your bottleneck is memory-bound (model already fits in GPU)
- If workload is primarily single-user
- If draft model inference cost exceeds the throughput gain
Decision framework
What's your main problem?
├── GPU cost too high
│ └── Start with AWQ 4-bit quantization + continuous batching (vLLM)
├── High per-request latency
│ └── KV cache optimization (PagedAttention) + speculative decoding
├── Can't serve enough users
│ └── Continuous batching + quantization to free up memory
└── Quality degraded with quantization
└── AWQ instead of GPTQ, or stay at 8 bits for sensitive tasksMost teams solve 60-80% of the problem with quantization + continuous batching. The remaining techniques (speculative decoding, KV cache compression) come in when there's a specific need or when scale demands it.
Next steps
- Measure before optimizing: baseline of P50/P99 latency, throughput (tokens/s), and cost per 1M tokens with current configuration
- Implement quantization first: the best cost-benefit change with lowest risk
- Add continuous batching: if using vLLM or TGI, it's already available — just configure it
- Evaluate speculative decoding only if you need latency reduction with preserved quality
Need to optimize LLM inference in production? Talk to Imperialis about inference optimization and reduce cost and latency based on data.