Developer tools

LLM Inference Optimization — Speculative Decoding, KV Cache, and Quantization in Practice

Concrete techniques to reduce cost and latency in large language model inference, with documented trade-offs and recommendations for when to apply each one.

3/26/2026•7 min read•Dev tools

LLM Inference Optimization — Speculative Decoding, KV Cache, and Quantization in Practice

Executive summary

Concrete techniques to reduce cost and latency in large language model inference, with documented trade-offs and recommendations for when to apply each one.

Last updated: 3/26/2026

Sources

This article does not list external links. Sources will appear here when provided.

Why this matters now

Most engineering teams have moved past the "run an LLM locally" phase to the "serve an LLM in production with controlled costs" phase. And the realization is unanimous: inference is where the money actually goes. The cost of serving a multi-billion parameter model at non-trivial volume easily surpasses training costs within a few months.

The good news: in 2026, the inference optimization ecosystem has matured enough that most techniques are applicable without deep application rewrites.

Quantization — the technique with the best ROI

Quantization reduces the numerical precision of model weights (from FP16 to INT8, INT4, or even INT3), cutting memory usage and accelerating computation with minimal quality loss.

Key approaches

GPTQ: one-shot post-training quantization, widely supported. Good for models you won't retrain. Noticeable quality degradation below 4 bits on complex tasks.
AWQ: protects important model channels (salient activations), maintaining superior quality to GPTQ at 4 bits with similar deployment cost.
bitsandbytes (NF4): direct Hugging Face integration, dynamic loading, ideal for prototyping. Slightly slower than GPTQ/AWQ in raw throughput.

When to use

python# Example with bitsandbytes — loading in 4-bit
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-70B",
    quantization_config=quantization_config,
    device_map="auto"
)

For most production cases, AWQ at 4 bits offers the best balance between quality and performance. Reserve GPTQ for when you need maximum speed and can tolerate slight degradation.

KV Cache — the bottleneck everyone ignores

The KV cache stores attention key-value vectors for previous tokens, avoiding recomputation. The problem: in large models with long contexts, it consumes more memory than the model weights themselves.

Optimization strategies

PagedAttention (vLLM): manages KV cache as virtual memory pages, eliminating fragmentation and allowing effectively larger batch sizes. Typical 2-4x memory reduction vs contiguous allocation.
Windowed attention: limits attention context to a sliding window. Appropriate when relevant information is concentrated at the end of the prompt. Quality loss on tasks depending on long-range recall.
KV cache compression: techniques like H2O (Heavy Hitter Oracle) discard less relevant cache entries. Useful for very long contexts (>16K tokens).

Practical configuration with vLLM

pythonfrom vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3-70B-AWQ",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.92,
    max_model_len=8192,
    kv_cache_dtype="fp8_e5m2"  # additional KV cache compression
)

Continuous Batching — serving multiple users efficiently

Static batching (waiting for N requests to process together) is simple but introduces latency. Continuous batching (dynamic batching) processes requests as they arrive, adding and removing from the batch in real time.

Frameworks like vLLM and TGI implement this natively. The gains are significant: without continuous batching, throughput drops 3-5x when traffic is variable. In 2026, there's no reason to use static batching in production.

Speculative Decoding — faster inference without quality reduction

Speculative decoding uses a smaller model (draft model) to generate multiple candidate tokens, which are validated in parallel by the large model (target model). Correct tokens are kept; incorrect ones are discarded and regenerated.

When it's worth it:

When you have a large model serving in production and can run a smaller model on the same hardware
Per-token latency doesn't change much, but per-batch latency improves significantly
The ideal draft model is 10-20% of the target model's size, from the same family

When it's not:

If your bottleneck is memory-bound (model already fits in GPU)
If workload is primarily single-user
If draft model inference cost exceeds the throughput gain

Decision framework

What's your main problem?
├── GPU cost too high
│   └── Start with AWQ 4-bit quantization + continuous batching (vLLM)
├── High per-request latency
│   └── KV cache optimization (PagedAttention) + speculative decoding
├── Can't serve enough users
│   └── Continuous batching + quantization to free up memory
└── Quality degraded with quantization
    └── AWQ instead of GPTQ, or stay at 8 bits for sensitive tasks

Most teams solve 60-80% of the problem with quantization + continuous batching. The remaining techniques (speculative decoding, KV cache compression) come in when there's a specific need or when scale demands it.

Next steps

Measure before optimizing: baseline of P50/P99 latency, throughput (tokens/s), and cost per 1M tokens with current configuration
Implement quantization first: the best cost-benefit change with lowest risk
Add continuous batching: if using vLLM or TGI, it's already available — just configure it
Evaluate speculative decoding only if you need latency reduction with preserved quality

Need to optimize LLM inference in production? Talk to Imperialis about inference optimization and reduce cost and latency based on data.

Talk to Imperialis about inference optimization Explore more articles