How to Build a Predictive Kubernetes Autoscaler with Prometheus and Machine Learning

Reactive autoscaling is one of the superpowers of Kubernetes. The Horizontal Pod Autoscaler (HPA) can automatically scale your application based on observed metrics like CPU utilization or request counts, ensuring you have just enough resources to handle the current load. But what about the load you know is coming?

For applications with predictable traffic patterns—like daily peaks, weekly cycles, or anticipated flash sales—reactive scaling can fall short. By the time the HPA detects a spike and starts new pods, your users might already be experiencing slowdowns or errors.

This is where predictive autoscaling comes in. By leveraging historical data from a monitoring tool like Prometheus and applying a bit of machine learning, we can forecast future demand and scale our applications before the traffic hits. In this post, we’ll walk through the architecture and a proof-of-concept implementation of a predictive autoscaler that does just that.

Why Reactive Autoscaling Isn’t Always Enough

To understand the need for prediction, we first need to appreciate the mechanics and limitations of reactive scaling.

Understanding the Horizontal Pod Autoscaler (HPA)

The standard Kubernetes Horizontal Pod Autoscaler is a control loop that periodically checks metrics from the Metrics Server or a custom metrics API. You define a target metric (e.g., 70% CPU utilization) and the HPA adjusts the number of replicas in a Deployment or ReplicaSet to try and meet that target.

For example, if the average CPU utilization across your pods jumps to 90% and your target is 70%, the HPA will calculate the need for more pods and increase the replica count.

The Lag Problem: When Spikes Outpace Scaling

The HPA is powerful, but it’s inherently reactive. There’s a delay between the moment traffic increases and the moment your application is fully scaled to handle it. This lag is composed of several steps:

Metric Collection: Prometheus orอีก an agent scrapes the metrics from your pods.
Aggregation: The Metrics Server or adapter aggregates these metrics.
HPA Evaluation: The HPA controller fetches the metric and compares it to the target.
Scaling Decision: The HPA updates the replica count on the parent resource.
Pod Startup: The Kubernetes scheduler assigns the new pods to nodes, the container runtime pulls the image, and the application starts.

This entire process can take minutes. For a sudden, massive spike in traffic, those minutes can be the difference between a smooth user experience and a cascade of 503 Service Unavailable errors.

Architecting a Predictive Autoscaler

A predictive autoscaler flips the model: instead of reacting to what’s happening now, it acts based on what it expects to happen next. Our architecture will consist of three main components.

(Conceptual Diagram)

Core Components

Prometheus: Our time-series database and data source. It continuously scrapes and stores application metrics, such as http_requests_total. We’ll use this historical data to train our model.
Machine Learning Model: The forecasting engine. This can be a Python script that uses a time-series forecasting library like Facebook’s Prophet or a classic statistical model like ARIMA. It will query Prometheus, train on the historical data, and predict future request volumes.
Custom Controller: The actuator. This component takes the forecast from the ML model, calculates the required number of replicas, and uses the Kubernetes API to update the deployment.

The Workflow

The end-to-end process looks like this:

Collect: Prometheus scrapes a request counter metric from our application.
Train: A scheduled job (e.g., a Kubernetes CronJob) periodically runs a Python script to query the last several days of metrics from Prometheus and retrain a time-series model.
Predict: The custom controller, running as a Deployment in the cluster, loads the latest model and makes a forecast for the near future (e.g., the next 30 minutes).
Calculate: The controller translates the predicted request volume into a target replica count. For instance: replicas = ceil(predicted_requests_per_second / max_requests_per_pod).
Act: The controller patches the target Kubernetes Deployment with the new replica count.
Repeat: The controller sleeps for a set interval and repeats the cycle.

Building a Proof-of-Concept

Let’s build a simplified version of this system. We’ll use Python for our controller and a simple forecasting model.

Step 1: Instrument Your Application and Prometheus

First, ensure your application exposes a Prometheus metric. A Counter for total HTTP requests is perfect. In a Python Flask app with the prometheus-flask-exporter library, this is automatic.

Next, configure your Prometheus instance to scrape this metric. Here’s a sample scrape config:

# prometheus.yml
scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['my-app-service.default.svc.cluster.local:8000']

We’ll use a PromQL query to get the rate of requests, which will be the input for our model:

rate(http_requests_total{job="my-app"}[5m])

Step 2: The Prediction Service (ML Model)

For our PoC, we’ll use statsmodels to create a simple ARIMA model. This script will fetch data from Prometheus, train a model, and save it to a file.

# train_model.py
import requests
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
import pickle
import time

PROMETHEUS_URL = 'http://prometheus-k8s.monitoring.svc.cluster.local:9090'
QUERY = 'rate(http_requests_total{job="my-app"}[5m])'
MODEL_PATH = 'arima_model.pkl'

def fetch_prometheus_data(query: str, duration: str = '3d') -> pd.DataFrame:
    """Fetches time-series data from Prometheus."""
    end_time = int(time.time())
    start_time = end_time - pd.to_timedelta(duration).total_seconds()
    
    response = requests.get(
        f'{PROMETHEUS_URL}/api/v1/query_range',
        params={
            'query': query,
            'start': start_time,
            'end': end_time,
            'step': '1m' # 1-minute intervals
        }
    )
    response.raise_for_status()
    results = response.json()['data']['result']
    
    if not results:
        raise ValueError("No data returned from Prometheus")

    # Convert to pandas DataFrame
    data = results[0]['values']
    df = pd.DataFrame(data, columns=['timestamp', 'value'])
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
    df.set_index('timestamp', inplace=True)
    df['value'] = df['value'].astype(float)
    return df

def train_and_save_model(data: pd.DataFrame):
    """Trains an ARIMA model and saves it."""
    # Ensure the series is stationary (simple differencing)
    series = data['value'].diff().dropna()
    
    # Define and fit the model (p,d,q) parameters are chosen for simplicity
    model = ARIMA(series, order=(5, 1, 0))
    model_fit = model.fit()
    
    # Save the model
    with open(MODEL_PATH, 'wb') as pkl:
        pickle.dump(model_fit, pkl)
    print(f"Model trained and saved to {MODEL_PATH}")

if __name__ == "__main__":
    print("Fetching data from Prometheus...")
    historical_data = fetch_prometheus_data(QUERY, duration='3d')
    print(f"Fetched {len(historical_data)} data points.")
    
    print("Training model...")
    train_and_save_model(historical_data)

You would run this script periodically using a Kubernetes CronJob.

Step 3: Building the Custom Controller

Our controller will be a Python script that loads the model, makes predictions, and patches the target deployment. We’ll use the kubernetes Python client.

First, let’s define the RBAC permissions our controller needs to patch deployments.

# rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: predictive-autoscaler-sa
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: predictive-autoscaler-role
  namespace: default
rules:
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "patch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: predictive-autoscaler-rb
  namespace: default
subjects:
- kind: ServiceAccount
  name: predictive-autoscaler-sa
  namespace: default
roleRef:
  kind: Role
  name: predictive-autoscaler-role
  apiGroup: rbac.authorization.k8s.io

Now, the controller script itself:

# controller.py
import os
import time
import pickle
import math
from kubernetes import client, config

# --- Configuration ---
MODEL_PATH = 'arima_model.pkl'
TARGET_DEPLOYMENT = 'my-app'
TARGET_NAMESPACE = 'default'
MAX_REQ_PER_POD = 100  # Max requests per second a single pod can handle
MIN_REPLICAS = 2
MAX_REPLICAS = 20
PREDICTION_HORIZON_MINS = 30 # Predict 30 minutes ahead
LOOP_INTERVAL_SECS = 60 * 5 # Run every 5 minutes

def load_k8s_config():
    """Loads Kubernetes configuration, in-cluster or from kubeconfig."""
    if "KUBERNETES_PORT" in os.environ:
        config.load_incluster_config()
    else:
        config.load_kube_config()

def predict_future_load(model_fit) -> float:
    """Makes a forecast using the loaded model."""
    # Forecast for the next 30 steps (minutes)
    forecast = model_fit.forecast(steps=PREDICTION_HORIZON_MINS)
    
    # For simplicity, we'll take the max predicted value in the horizon
    max_predicted_load = forecast.max()
    print(f"Max predicted load in next {PREDICTION_HORIZON_MINS} mins: {max_predicted_load:.2f} rps")
    return max_predicted_load

def calculate_replicas(predicted_load: float) -> int:
    """Calculates the required replica count from the predicted load."""
    if predicted_load <= 0:
        return MIN_REPLICAS
    
    required_replicas = math.ceil(predicted_load / MAX_REQ_PER_POD)
    
    # Clamp the value between min and max replicas
    clamped_replicas = max(MIN_REPLICAS, min(required_replicas, MAX_REPLICAS))
    print(f"Calculated required replicas: {required_replicas}, Clamped to: {clamped_replicas}")
    return clamped_replicas

def update_deployment_replicas(apps_v1_api, replica_count: int):
    """Patches the target deployment with the new replica count."""
    patch_body = {
        "spec": {
            "replicas": replica_count
        }
    }
    try:
        apps_v1_api.patch_namespaced_deployment(
            name=TARGET_DEPLOYMENT,
            namespace=TARGET_NAMESPACE,
            body=patch_body
        )
        print(f"Successfully scaled deployment '{TARGET_DEPLOYMENT}' to {replica_count} replicas.")
    except client.ApiException as e:
        print(f"Error patching deployment: {e}")

if __name__ == "__main__":
    print("Loading Kubernetes configuration...")
    load_k8s_config()
    apps_v1 = client.AppsV1Api()

    print("Loading ML model...")
    try:
        with open(MODEL_PATH, 'rb') as pkl:
            model = pickle.load(pkl)
    except FileNotFoundError:
        print(f"Error: Model file not found at {MODEL_PATH}. Ensure the training job has run.")
        exit(1)

    while True:
        print("\n--- Starting new prediction cycle ---")
        predicted_load = predict_future_load(model)
        new_replica_count = calculate_replicas(predicted_load)
        update_deployment_replicas(apps_v1, new_replica_count)
        
        print(f"Sleeping for {LOOP_INTERVAL_SECS} seconds...")
        time.sleep(LOOP_INTERVAL_SECS)

Step 4: Deploying the Solution

We need to package our controller script and the trained model into a Docker image.

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

RUN pip install kubernetes requests pandas statsmodels

COPY controller.py .
# Copy the trained model file into the image
COPY arima_model.pkl .

CMD ["python", "controller.py"]

Finally, deploy the controller to your cluster.

# controller-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: predictive-autoscaler
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: predictive-autoscaler
  template:
    metadata:
      labels:
        app: predictive-autoscaler
    spec:
      serviceAccountName: predictive-autoscaler-sa
      containers:
      - name: controller
        image: your-repo/predictive-autoscaler:latest # Replace with your image

From Proof-of-Concept to Production

This PoC demonstrates the core logic, but a production-grade system requires more robustness.

MLOps and Model Management

A single arima_model.pkl file in a container isn’t a scalable solution. In a real-world scenario, you’d implement an MLOps pipeline:

Automated Retraining: Use a tool like Kubeflow Pipelines or GitHub Actions to run the training script on a schedule.
Model Registry: Store and version your trained models in a registry like MLflow or an AWS S3 bucket. The controller would fetch the latest “production” model from there.
Model Monitoring: Track your model’s prediction accuracy. If the model’s performance degrades (a concept known as “model drift”), trigger an alert or an automatic retrain.

Advanced Scaling Logic

Our controller logic is simple. A production system should be smarter:

Fallback Mechanism: What happens if the model fails or Prometheus is down? The controller should have a fallback, such as doing nothing and letting a traditional HPA take over.
Combining Predictive and Reactive: You can run both a predictive and a reactive scaler. The autoscaler can be set to whichever of the two demands a higher replica count. This gives you the best of both worlds: proactive scaling for known patterns and reactive scaling for unexpected surges. Frameworks like KEDA are excellent for this, as they allow multiple scalers for a single workload.

Security and Reliability

Least Privilege: The Role we created is narrowly scoped, but always double-check that your controller’s ServiceAccount has only the permissions it absolutely needs.
High Availability: Run the controller deployment with at least two replicas to avoid a single point of failure. You’ll need a leader election mechanism to ensure only one controller instance is active at a time.

Conclusion

While the Kubernetes HPA is a fantastic tool for reactive scaling, it can’t see what’s coming. By building a predictive autoscaler, you can anticipate load changes and ensure your application is always one step ahead of demand.

We’ve shown that by combining the rich historical data in Prometheus with the forecasting power of time-series ML models, you can create a custom controller that makes intelligent, proactive scaling decisions. This proof-of-concept is a starting point, but it illustrates a powerful pattern for building more resilient and cost-efficient systems on Kubernetes.

What are your thoughts on predictive autoscaling? Have you implemented a similar system? Share your experiences and ideas in the comments below

A Step-by-Step Guide to Automated Threat Modeling with LLMs in Your DevSecOps Pipeline

How to Automate GitHub Workflows with Gemini CLI and the MCP Toolkit for Docker