9 min read

Building a Scalable MLOps Pipeline: A Step-by-Step Tutorial Using Kubeflow on Kubernetes

Table of Contents

In the world of machine learning, moving a model from a data scientist’s notebook to a production environment is where most projects falter. The gap between development and operations can lead to brittle, unmaintainable, and unscalable ML systems. This is the challenge that MLOps, or Machine Learning Operations, aims to solve.

By applying DevOps principles to the machine learning lifecycle, MLOps enables robust AI automation, reproducibility, and scalability. One of the most powerful, cloud-native toolkits for achieving this is Kubeflow, which runs on top of the container orchestration giant, Kubernetes.

This tutorial will guide you through building a complete, scalable machine learning pipeline on Kubernetes using Kubeflow. You will learn how to containerize each step of your ML workflow, define a reproducible pipeline, and lay the groundwork for a fully automated CI/CD for ML system.

Table of contents

Why Kubeflow on Kubernetes for MLOps?

Before we dive into the tutorial, let’s clarify why this combination is so effective.

  • Kubernetes: As the de facto standard for container orchestration, Kubernetes provides a resilient, scalable, and portable foundation. It manages the underlying compute, storage, and networking resources, allowing your ML workloads to run consistently across any environment—from your local machine to any public cloud.
  • Kubeflow: Kubeflow builds on this foundation, providing a curated set of tools specifically for machine learning. It’s not a single monolithic application but a collection of cloud-native components for each stage of the ML lifecycle, including experimentation, pipelining, training, and serving.

Together, they provide a powerful platform for building a production-grade machine learning pipeline that is both portable and scalable by design.

Prerequisites

To follow this tutorial, you’ll need a basic understanding of ML concepts, Python, and containers. You should have the following tools installed:

Step-by-Step Tutorial: Building Your MLOps Pipeline

We will now build a simple ML pipeline that preprocesses data, trains a model, and evaluates it. Each step will be a distinct, containerized component orchestrated by Kubeflow Pipelines.

Step 1: Install Kubeflow on Your Cluster

The recommended way to install Kubeflow is using its official kustomize manifests. This method provides fine-grained control over which components you deploy.

First, clone the Kubeflow manifest repository:

# Clone the Kubeflow manifests repository
git clone https://github.com/kubeflow/manifests.git
cd manifests

Next, deploy the core components using kustomize and kubectl. This command installs Kubeflow Pipelines and its dependencies.

# Apply the kustomize manifests for Kubeflow Pipelines
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Installation can take several minutes. Once complete, you can access the Kubeflow Central Dashboard by port-forwarding the service.

# Port-forward the Istio ingress gateway to access the dashboard
kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80

Now, navigate to http://localhost:8080 in your browser to see the Kubeflow dashboard.

Step 2: Designing the Machine Learning Pipeline

Our pipeline will consist of three simple steps:

  1. Prepare Data: Load a dataset and split it into training and testing sets.
  2. Train Model: Train a simple Scikit-learn classifier on the training data.
  3. Evaluate Model: Evaluate the model’s accuracy on the test data and print the result.

Each of these steps will become a Kubeflow Pipeline “component.” A component is a self-contained function packaged as a Docker image.

Step 3: Creating Kubeflow Pipeline Components in Python

Kubeflow components are defined as standard Python functions decorated with @dsl.component. The type annotations tell Kubeflow how to handle inputs and outputs, which can be simple values (str, int) or file paths (Input[Dataset], Output[Model]).

Let’s create a file named pipeline.py and define our components.

# pipeline.py
from kfp import dsl
from kfp.dsl import Input, Output, Dataset, Model, Metrics

# Note: You'll need to specify a base Docker image.
# For a real project, you would build your own image with necessary libraries.
BASE_IMAGE = 'python:3.9'

@dsl.component(base_image=BASE_IMAGE, packages_to_install=['scikit-learn==1.3.0', 'pandas==2.0.3'])
def prepare_data(
    iris_dataset: Output[Dataset]
):
    """Loads the Iris dataset and saves it for the next component."""
    import pandas as pd
    from sklearn.datasets import load_iris

    iris = load_iris(as_frame=True)
    # Concatenate features and target into a single DataFrame
    data_df = pd.concat([iris.data, iris.target], axis=1)

    # Save the data to the path provided by Kubeflow
    data_df.to_csv(iris_dataset.path, index=False)
    print("Data prepared and saved.")


@dsl.component(base_image=BASE_IMAGE, packages_to_install=['scikit-learn==1.3.0', 'pandas==2.0.3'])
def train_model(
    dataset: Input[Dataset],
    model_artifact: Output[Model]
):
    """Trains a simple Logistic Regression model."""
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    import joblib

    data_df = pd.read_csv(dataset.path)
    X = data_df.drop('target', axis=1)
    y = data_df['target']

    X_train, _, y_train, _ = train_test_split(X, y, test_size=0.3, random_state=42)

    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)

    # Save the trained model to the path provided by Kubeflow
    joblib.dump(model, model_artifact.path)
    print(f"Model trained and saved to {model_artifact.path}")


@dsl.component(base_image=BASE_IMAGE, packages_to_install=['scikit-learn==1.3.0', 'pandas==2.0.3'])
def evaluate_model(
    dataset: Input[Dataset],
    model_artifact: Input[Model],
    metrics: Output[Metrics]
):
    """Evaluates the model and logs metrics."""
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    import joblib

    data_df = pd.read_csv(dataset.path)
    X = data_df.drop('target', axis=1)
    y = data_df['target']

    _, X_test, _, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    model = joblib.load(model_artifact.path)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    # Log metrics to Kubeflow UI
    metrics.log_metric("accuracy", round(accuracy, 2))
    print(f"Model accuracy: {accuracy:.2f}")

Important Note: For dsl.component to work without a pre-built image, you must use kfp version 2.0 or newer. This feature builds a container for you automatically. In a production scenario, you would build a custom Docker image with all dependencies and reference it via the base_image parameter for faster and more reliable execution.

Step 4: Assembling the Pipeline with the Python DSL

Now, we connect these components into a Directed Acyclic Graph (DAG) using a function decorated with @dsl.pipeline.

Add the following to your pipeline.py file:

# pipeline.py (continued)

@dsl.pipeline(
    name='iris-classifier-pipeline',
    description='A simple pipeline to train an Iris classifier.'
)
def iris_pipeline():
    # Step 1: Prepare the data
    prepare_data_task = prepare_data()

    # Step 2: Train the model using the output from the prepare_data step
    train_model_task = train_model(
        dataset=prepare_data_task.outputs['iris_dataset']
    )

    # Step 3: Evaluate the model using the outputs from the previous steps
    evaluate_model(
        dataset=prepare_data_task.outputs['iris_dataset'],
        model_artifact=train_model_task.outputs['model_artifact']
    )

Notice how the outputs of one task are passed as inputs to the next. Kubeflow handles the data passing between these containerized steps under the hood, typically using a shared object storage volume.

Step 5: Compiling and Uploading the Pipeline

With the pipeline defined, we need to compile it into a YAML file that Kubeflow can understand. The kfp CLI makes this easy.

First, add the compiler invocation to the end of your pipeline.py file to make it executable.

# pipeline.py (continued)

if __name__ == '__main__':
    from kfp.compiler import Compiler
    Compiler().compile(
        pipeline_func=iris_pipeline,
        package_path='iris_pipeline.yaml'
    )
    print("Pipeline compiled to iris_pipeline.yaml")

Now, run the script to generate the YAML file:

python pipeline.py

You will now have a file named iris_pipeline.yaml. To run it, go to your Kubeflow dashboard (http://localhost:8080), navigate to the “Pipelines” section, and click “Upload pipeline”. Upload the iris_pipeline.yaml file.

Step 6: Running and Monitoring the Pipeline

After uploading, you’ll see your iris-classifier-pipeline in the UI. Click “Create run” to start an execution. You can provide experiment details and then launch the run.

The Kubeflow UI will display the pipeline’s execution graph in real-time. You can click on each component to view its logs, inputs, outputs, and any generated artifacts or metrics, like the accuracy we logged in the evaluation step.

Next Steps: CI/CD for ML and Model Serving

You’ve successfully built and run a reproducible machine learning pipeline! This is the foundation of AI automation. The next logical steps are:

1. Automating with CI/CD

To achieve true CI/CD for ML, you can automate the process of testing, compiling, and uploading your pipeline using tools like GitHub Actions or Jenkins. A typical workflow would trigger on a git push to the main branch, run tests on your components, compile the pipeline, and automatically upload the new version to Kubeflow.

Here’s a conceptual example of a GitHub Actions workflow step:

# .github/workflows/main.yml
# (Conceptual snippet)
- name: Compile and Upload Kubeflow Pipeline
  run: |
    pip install kfp
    python ./pipelines/pipeline.py # Compiles to iris_pipeline.yaml
    # Use kfp-cli or a custom script to upload iris_pipeline.yaml
    # This step would require authenticating with your Kubeflow instance

2. Serving the Model with KServe

Once a model is trained and evaluated, you need to deploy it for inference. Kubeflow integrates seamlessly with KServe (formerly KFServing) for this purpose. KServe provides a simple, serverless way to deploy models on Kubernetes, with advanced features like autoscaling (including scale-to-zero), canary deployments, and model explainability.

Deploying a model is as simple as applying a YAML manifest:

# kserve-deployment.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "iris-classifier"
spec:
  predictor:
    sklearn:
      storageUri: "pvc://<your-pvc-name>/<path-to-your-model.joblib>" # Path where the trained model was saved

You would apply this with kubectl apply -f kserve-deployment.yaml. KServe handles creating a REST endpoint for your model.

Conclusion

In this tutorial, we moved beyond theory and built a tangible, scalable MLOps pipeline using Kubeflow on Kubernetes. We defined containerized components, assembled them into a reproducible workflow, and executed it in a cloud-native environment. We also touched on the path forward: integrating with CI/CD tools for full automation and deploying models for real-world use with KServe.

Kubeflow provides a robust, open-source framework for standardizing your MLOps practices. By leveraging the power of Kubernetes, it ensures that your machine learning workflows are scalable, portable, and ready for the demands of production.

Now it’s your turn. Try extending this pipeline with more complex models, integrating a feature store, or setting up a full CI/CD loop. Share your experiences and questions in the comments below!