Cloud and platform

Kubernetes Cost Optimization: practical techniques for reducing production costs

Kubernetes cost optimization involves resource right-sizing, intelligent autoscaling, strategic node pools, and continuous infrastructure efficiency monitoring.

3/12/20268 min readCloud
Kubernetes Cost Optimization: practical techniques for reducing production costs

Executive summary

Kubernetes cost optimization involves resource right-sizing, intelligent autoscaling, strategic node pools, and continuous infrastructure efficiency monitoring.

Last updated: 3/12/2026

The problem of unoptimized costs

Kubernetes offers unprecedented orchestration power, but this flexibility comes with a risk: hidden costs that grow exponentially when resources are not managed proactively. In enterprise clusters, it's common to find up to 40% computational waste due to over-provisioning, underutilized pods, and unbalanced node pools.

The challenge isn't simply "reduce costs" — it's reducing costs without sacrificing availability, performance, or scaling capacity. Effective optimization requires a systematic approach combining resource right-sizing, intelligent autoscaling, and continuous monitoring.

Right-sizing: the foundation of optimization

Request vs Limit: the over-provisioning trap

Correct configuration of requests and limits is the first step to optimization, but it's also where most errors occur.

yaml# WRONG: typical over-provisioning
apiVersion: v1
kind: Pod
metadata:
  name: app-server
spec:
  containers:
  - name: server
    image: myapp:latest
    resources:
      requests:
        cpu: "2000m"      # 2 cores for workload using 200m
        memory: "4Gi"     # 4GB for workload using 512MB
      limits:
        cpu: "4000m"
        memory: "8Gi"
yaml# CORRECT: right-sizing based on real metrics
apiVersion: v1
kind: Pod
metadata:
  name: app-server
spec:
  containers:
  - name: server
    image: myapp:latest
    resources:
      requests:
        cpu: "250m"       # 25% margin over baseline
        memory: "650Mi"   # 25% margin + overhead
      limits:
        cpu: "500m"       # 2x request for burst allowance
        memory: "1Gi"     # 1.5x request for peaks

Right-sizing framework

The right-sizing process should follow a systematic approach:

bash# 1. Collect baseline metrics per workload
kubectl top pods -n production -l app=api-server --use-protocol-buffers | \
  awk '{sum+=$2; count++} END {print "Avg CPU:", sum/count "m"}'

kubectl top pods -n production -l app=api-server | \
  awk '{sum+=$3; count++} END {print "Avg Memory:", sum/count "Mi"}'

# 2. Identify temporal usage patterns
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/production/pods | \
  jq '.items[] | select(.metadata.labels.app=="api-server") | .containers[] | \
  {name: .name, cpu: .usage.cpu, memory: .usage.memory}'

# 3. Calculate P95 with tools like Prometheus
# Query example: quantile_over_time(0.95, rate(container_cpu_usage_seconds_total[5m])[24h:])

Practical right-sizing rule:

  • CPU Request = P70 usage + 25% buffer
  • CPU Limit = Request × 2 (burst allowance)
  • Memory Request = P90 usage + 30% buffer
  • Memory Limit = Request × 1.5 (prevent OOM)

Intelligent autoscaling: HPA and VPA

Horizontal Pod Autoscaler (HPA)

HPA scales horizontally based on utilization metrics or custom metrics.

yamlapiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
  # CPU metric (default)
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  # Memory metric
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  # Custom metric (ex: requests per second)
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 4
        periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 300
      selectPolicy: Min

Vertical Pod Autoscaler (VPA)

VPA automatically adjusts requests and limits based on usage history.

yamlapiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"  # Off, Initial, Recreate, Auto
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: "100m"
        memory: "256Mi"
      maxAllowed:
        cpu: "2000m"
        memory: "4Gi"
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits

HPA + VPA: when and how to combine

ConfigurationWhen to useTrade-offs
HPA onlyHorizontal workloads (stateless)Waste if pods are under-provisioned
VPA onlyWorkloads with stable profileDoesn't scale horizontally
HPA + VPAWorkloads with variable patternsHigher complexity, VPA can conflict with HPA
yaml# HPA + VPA combination with fine control
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  updatePolicy:
    updateMode: "Off"  # VPA only recommends, HPA decides scaling
  recommenders:
  - name: k8s.io/vpa-recommender
  - name: k8s.io/vpa-empty-recommender
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Node Pools and Node Autoscaling

Cluster Autoscaler

Cluster Autoscaler adjusts the number of nodes based on pending pods.

bash# Configure Cluster Autoscaler (Cloud Provider specific)
# AWS EKS example:
eksctl utils associate-iam-oidc-provider \
  --region us-east-1 \
  --cluster production-cluster

eksctl create iamserviceaccount \
  --cluster production-cluster \
  --namespace kube-system \
  --name cluster-autoscaler \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterAutoscalerPolicy \
  --approve \
  --override-existing-serviceaccounts

# Deploy with optimized configuration
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.1
        name: cluster-autoscaler
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-system-pods=false
        - --balance-similar-node-groups=true
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production-cluster
        env:
        - name: AWS_REGION
          value: us-east-1
        resources:
          limits:
            cpu: "100m"
            memory: "300Mi"
          requests:
            cpu: "100m"
            memory: "300Mi"
EOF

Stratified Node Pools

Pool strategy based on workload allows granular cost optimization.

bash# AWS EKS node groups optimized
# 1. On-demand pool for critical workloads
eksctl create nodegroup \
  --cluster production-cluster \
  --region us-east-1 \
  --name critical-workloads \
  --node-type c6i.xlarge \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 5 \
  --managed \
  --asg-access \
  --external-dns-access

# 2. Spot pool for interruptible workloads
eksctl create nodegroup \
  --cluster production-cluster \
  --region us-east-1 \
  --name spot-workers \
  --node-type c6i.2xlarge \
  --nodes 10 \
  --nodes-min 5 \
  --nodes-max 20 \
  --managed \
  --spot \
  --spot-max-price "0.5" \
  --instance-selector "vCPUs>=4,Memory>=8Gi"

# 3. Arm pool for memory-intensive workloads
eksctl create nodegroup \
  --cluster production-cluster \
  --region us-east-1 \
  --name memory-workers \
  --node-type r6g.xlarge \
  --nodes 2 \
  --nodes-min 1 \
  --nodes-max 4 \
  --managed
yaml# Taints and tolerations for specific workloads
apiVersion: v1
kind: Pod
metadata:
  name: batch-processor
  namespace: batch-jobs
spec:
  tolerations:
  - key: "workload-type"
    operator: "Equal"
    value: "spot"
    effect: "NoSchedule"
  containers:
  - name: processor
    image: batch-processor:latest
    resources:
      requests:
        cpu: "1000m"
        memory: "4Gi"
---
apiVersion: v1
kind: Pod
metadata:
  name: api-server
  namespace: production
spec:
  nodeSelector:
    workload-type: "critical"
  containers:
  - name: server
    image: api-server:latest

Pod Disruption Budgets and efficiency

PDBs ensure availability during upgrades and scaling without over-provisioning.

yamlapiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
  namespace: production
spec:
  minAvailable: 2  # Minimum available pods
  selector:
    matchLabels:
      app: api-server
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: batch-worker-pdb
  namespace: batch-jobs
spec:
  maxUnavailable: 2  # Maximum unavailable pods
  selector:
    matchLabels:
      app: batch-worker

Optimized PDB strategy:

  • Critical: minAvailable: N-1 (where N is replica count)
  • Batch: maxUnavailable: 50%
  • Stateful: minAvailable: quorum (N/2 + 1)

Cost efficiency monitoring

Key FinOps metrics

promql# 1. CPU utilization by namespace
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod) /
sum(kube_pod_container_resource_requests{resource="cpu", namespace="production"}) by (pod)

# 2. Memory waste (requested vs used)
(sum(kube_pod_container_resource_requests{resource="memory", namespace="production"}) by (pod) -
 sum(container_memory_working_set_bytes{namespace="production"}) by (pod)) /
sum(kube_pod_container_resource_requests{resource="memory", namespace="production"}) by (pod)

# 3. Node efficiency (utilization vs capacity)
sum(rate(container_cpu_usage_seconds_total{node!=""}[5m])) by (node) /
sum(kube_node_status_capacity{resource="cpu"}) by (node)

# 4. Cost per request (custom business metric)
rate(cost_per_request_total[5m])

Dashboards and alerts

yaml# PrometheusRule for efficiency alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-efficiency-alerts
  namespace: monitoring
spec:
  groups:
  - name: cost-optimization
    rules:
    - alert: LowCPUEfficiency
      expr: |
        sum(rate(container_cpu_usage_seconds_total{namespace="production"}[1h])) by (pod) /
        sum(kube_pod_container_resource_requests{resource="cpu", namespace="production"}) by (pod) < 0.2
      for: 2h
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} has low CPU efficiency (<20%)"
        description: "Consider right-sizing CPU requests"

    - alert: HighMemoryWaste
      expr: |
        (sum(kube_pod_container_resource_requests{resource="memory", namespace="production"}) by (pod) -
         sum(container_memory_working_set_bytes{namespace="production"}) by (pod)) /
        sum(kube_pod_container_resource_requests{resource="memory", namespace="production"}) by (pod) > 0.6
      for: 2h
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} has high memory waste (>60%)"

    - alert: OverProvisionedReplicas
      expr: |
        sum(rate(container_cpu_usage_seconds_total[5m])) by (deployment) /
        (sum(kube_pod_container_resource_requests{resource="cpu"}) by (deployment) * kube_deployment_spec_replicas) < 0.3
      for: 4h
      labels:
        severity: info
      annotations:
        summary: "Deployment {{ $labels.deployment }} may be over-provisioned"

Workload-specific strategies

Stateless workloads (API servers, web apps)

yaml# Optimizations for stateless workloads
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3  # Minimum baseline
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - api-server
              topologyKey: kubernetes.io/hostname
      containers:
      - name: server
        image: api-server:latest
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
      terminationGracePeriodSeconds: 30
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Batch workloads (jobs, ETL)

yaml# Optimizations for batch workloads
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processor
  namespace: batch-jobs
spec:
  parallelism: 10
  completions: 10
  backoffLimit: 3
  template:
    spec:
      tolerations:
      - key: "workload-type"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: instance-type
                operator: In
                values:
                - spot
      containers:
      - name: processor
        image: batch-processor:latest
        resources:
          requests:
            cpu: "2000m"
            memory: "4Gi"
          limits:
            cpu: "4000m"
            memory: "8Gi"
        env:
        - name: PARALLELISM
          value: "10"
      restartPolicy: OnFailure
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: batch-worker-pdb
spec:
  maxUnavailable: 50%
  selector:
    matchLabels:
      job-name: data-processor

Cost optimization tools

OpenCost

yaml# Install OpenCost for cost tracking
kubectl apply -f https://raw.githubusercontent.com/opencost/opencost/develop/kubernetes/opencost.yaml

# Configure for specific cloud provider
apiVersion: v1
kind: ConfigMap
metadata:
  name: opencost
  namespace: opencost
data:
  opencost.yaml: |
    clusterName: production-cluster
    prometheus:
      internal:
        enabled: true
    cloudProvider: aws
    currencyCode: USD
EOF

Kubecost

bash# Install Kubecost
kubectl create namespace kubecost
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --set kubecostToken="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9" \
  --set serviceAccount.create=true \
  --set serviceAccount.name=kubecost-service-account \
  --set global.prometheus.fqdn=http://prometheus-server.monitoring.svc.cluster.local \
  --set global.prometheus.enabled=false

Continuous optimization framework

Monthly efficiency checklist

  1. Resources audit:
  • Identify pods with CPU utilization < 20% for > 24h
  • Identify pods with memory waste > 50% for > 24h
  • Check workloads without VPA configured
  1. Autoscaling audit:
  • Verify HPA configurations for stateless workloads
  • Confirm Cluster Autoscaler is active and functional
  • Validate node pools have adequate limits
  1. Node pools audit:
  • Analyze node utilization by pool
  • Identify underutilized pools (merge opportunity)
  • Validate spot instances are being used where appropriate
  1. Cost analysis:
  • Compare month-over-month cost by namespace
  • Identify cost anomalies
  • Correlate cost with business metrics
bash# Automated audit script
#!/bin/bash
NAMESPACE=${1:-production}

echo "=== Kubernetes Cost Optimization Audit ==="
echo "Namespace: $NAMESPACE"
echo ""

# 1. Low CPU efficiency pods
echo "1. Pods with low CPU efficiency (<20%):"
kubectl get pods -n $NAMESPACE -o json | \
  jq -r '.items[] | select(.spec.containers[].resources.requests.cpu) |
  "\(.metadata.name): \(.spec.containers[].resources.requests.cpu // "N/A")"'

# 2. High memory waste
echo ""
echo "2. Pods with high memory waste (>50%):"
kubectl top pods -n $NAMESPACE --use-protocol-buffers | \
  awk '$3 ~ /Mi/ {print $1, $3}'

# 3. Workloads without HPA
echo ""
echo "3. Deployments without HPA:"
kubectl get hpa -n $NAMESPACE -o json | \
  jq -r '.items[].spec.scaleTargetRef.name' | \
  sort | uniq > /tmp/hpa_deployments.txt
kubectl get deploy -n $NAMESPACE -o json | \
  jq -r '.items[].metadata.name' | \
  sort | uniq > /tmp/all_deployments.txt
comm -13 /tmp/hpa_deployments.txt /tmp/all_deployments.txt

# 4. Node utilization
echo ""
echo "4. Node utilization:"
kubectl top nodes --use-protocol-buffers

Conclusion

Kubernetes cost optimization is not a one-time activity — it's a continuous discipline combining resource right-sizing, intelligent autoscaling, strategic node pools, and efficiency monitoring. Organizations that treat cost optimization as a systematic process, not as a reaction to surprising cloud bills, achieve 30-50% reductions without sacrificing availability or performance.


Is your Kubernetes cluster experiencing runaway costs and you need a proven optimization strategy? Talk to Imperialis cloud specialists to implement a Kubernetes FinOps framework that reduces costs while maintaining availability and performance.

Sources

Related reading