Serving LLMs at Scale with KitOps, Kubeflow, and KServe

Introduction

Over the past few years, large language models (LLMs) have transformed how we build intelligent applications. From chatbots to code assistants, these models are used to power production systems across industries. But while training LLMs has become more accessible, deploying them at scale remains a challenge. Models generally come with gigabyte-sized weight files, depend on specific library versions, require careful GPU or CPU resource allocation, and need constant versioning as new checkpoints roll out. More often than not, a model that works in a data scientist’s notebook can fail in production because of a mismatched dependency, a missing tokenizer file, or an environment variable that wasn’t set.

KitOps (a CNCF pro…

Introduction

KitOps (a CNCF project backed by Jozu) offers a solution called ModelKits, which is a standardized artifact that packages an ML model with its dependencies and configuration. This open-source toolkit lets organizations, developers, and data scientists bundle their models into versionable, signable, and portable ModelKits that can be pushed to any OCI-compliant registry. The result is consistent version tracking and reliable model artifacts across all environments, bringing the same level of control we expect from software development to machine learning deployments.

In this guide, we’ll show you how to combine KitOps with Kubeflow and KServe to serve large language models at scale. You’ll learn how to package an LLM into a ModelKit, deploy it with KServe’s inference endpoints, and let Jozu handle the orchestration, all without needing dedicated GPU hardware to follow along—you can take an even deeper dive into production ML on Kubernetes by downloading our full technical guide to Kubernetes ML.

Learning Objectives

Build and package a TensorFlow LLM model into a ModelKit using KitOps
Pack and push the ModelKit to Jozu, an OCI-compliant registry built for ModelKits
Set up Kubeflow and KServe to serve your model in production
Scale and secure your model deployments in production environments

Prerequisites and Setup

Before we start deploying LLMs at scale, let’s make sure you have the right tools installed and configured. This section walks through everything you need such as Python for running your model code, the KitOps CLI for packaging ModelKits, and a Jozu sandbox account for storing and managing your artifacts.

Install Python

For this project, you’ll need Python 3.10 or above installed on your system. This ensures compatibility with modern ML libraries like TensorFlow and the dependencies we’ll use throughout this guide. If you don’t have Python installed yet, grab it from python.org and follow the installation steps for your operating system.

Install the KitOps CLI

The Kit CLI is what we’ll use to pack, push, and manage ModelKits. Head over to the KitOps installation page and pick the installation method that matches your OS, whether you’re on macOS, Linux, or Windows, and install accordingly.

Once you’ve installed the CLI, verify it’s working by running:

kit version

The output should show the version details:

Sign Up for Jozu

Jozu is your OCI-compliant registry for ModelKits. It’s where you’ll push packaged models and pull them during deployment. To get started with Jozu, head over to jozu.ml and click Sign Up to create an account. Make sure to note your username and password as you’ll need them in the next step to authenticate your CLI.

Authenticate with Jozu

Now let’s connect your local Kit CLI to your Jozu account. Open a terminal and run:

kit login jozu.ml

You’ll be prompted to enter your username (the email you registered with) and the password you created. If everything is set up correctly, you’ll see:

Building a TensorFlow LLM Model

TensorFlow is one of the most popular open-source frameworks for building and training machine learning models. It was developed by Google, and it’s particularly well-suited for production environments where you need scalable, efficient model serving across CPUs, GPUs, and TPUs.

TensorFlow shines in enterprise deployments, mobile applications, and in scenarios where you need tight integration with serving infrastructure. In this guide, we’ll use TensorFlow to fine-tune a small T5 model that translates corporate jargon into plain language.

Set Up Your Project Directory

Let’s start by creating a clean workspace for our model. Run these commands in your terminal to create your project directory:

mkdir corporate-speak
cd corporate-speak

Now create a Python virtual environment to keep dependencies isolated. It is essential to use a virtual environment as it isolates the project’s dependencies from your global Python installation, therefore preventing conflicts with other projects and ensuring reproducible results:

python3 -m venv env
source env/bin/activate

Install Dependencies

Create a requirements.txt file in your project root with the following libraries:

tensorflow==2.19.1
transformers==4.49.0
huggingface-hub==0.26.0
tf-keras
fastapi
uvicorn
sentencepiece

Install everything with:

pip install -r requirements.txt

This pulls in TensorFlow for training, Transformers for the T5 model, FastAPI for serving later, and all the supporting libraries we’ll need.

Create the Training Data

Before we can train our model, we need some data. Create a data directory in your project root:

mkdir data

Inside the data directory, create a file called corporate\_speak.json and paste this training dataset:

[
{
"term": "Circle back",
"meaning": "We'll talk about this later because we don't want to deal with it right now."
},
{
"term": "Synergy",
"meaning": "Making two teams do one team's job, but with extra meetings."
},
{
"term": "Bandwidth",
"meaning": "How much energy or patience a person has left."
},
{
"term": "Low-hanging fruit",
"meaning": "The easiest task that still lets us look productive."
},
{
"term": "Touch base",
"meaning": "Talk briefly to pretend progress is being made."
},
{
"term": "Pivot",
"meaning": "Our original idea failed; let's rename it and try again."
},
{ "term": "Going forward", "meaning": "Forget what we said last time." },
{ "term": "Alignment", "meaning": "Make sure no one disagrees publicly." }
]

This small dataset gives the model eight examples of corporate jargon and their plain-language meanings. It’s just enough to fine-tune T5 for our demonstration without requiring heavy compute resources.

Create the Training Script

Next, make a directory for your application code:

mkdir app

Inside the app directory, create a file called train\_llm.py and add this code:

import os
import json
import tensorflow as tf
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

BASE\_DIR \= os.path.dirname(os.path.dirname(os.path.abspath(\_\_file\_\_)))
DATA\_PATH \= os.path.join(BASE\_DIR, "data", "corporate\_speak.json")

print(f"Base Directory: {BASE\_DIR}")
print(f"Data Path: {DATA\_PATH}")

def load\_data(file\_path):
"""Loads JSON data from the specified file path."""
try:
with open(file\_path, 'r') as f:
data \= json.load(f)
print(f"Successfully loaded {len(data)} records from data file.")
return data
except FileNotFoundError:
print(f"ERROR: Data file not found at {file\_path}")
print("Please ensure you have created the file 'corporate\_speak.json' and the 'data' folder.")
return None
except json.JSONDecodeError:
print(f"ERROR: Could not decode JSON from {file\_path}. Check file format.")
return None

DATA \= load\_data(DATA\_PATH)
if DATA is None:
exit() ## Stop if data loading failed

prompts \= [f"term: {item['term']}" for item in DATA]
responses \= [f"meaning: {item['meaning']}" for item in DATA]

MODEL\_NAME \= 't5-small'
MAX\_LENGTH \= 128
BATCH\_SIZE \= 4
LEARNING\_RATE \= 1e-5
EPOCHS \= 15

print(f"\\nLoading T5 model and tokenizer: {MODEL\_NAME}...")
tokenizer \= T5Tokenizer.from\_pretrained(MODEL\_NAME)
model \= TFT5ForConditionalGeneration.from\_pretrained(MODEL\_NAME)

tokenized\_inputs \= tokenizer(
prompts,
return\_tensors='tf',
max\_length=MAX\_LENGTH,
padding='max\_length',
truncation=True
)

tokenized\_targets \= tokenizer(
responses,
return\_tensors='tf',
max\_length=MAX\_LENGTH,
padding='max\_length',
truncation=True
)

labels \= tokenized\_targets['input\_ids']

dataset \= tf.data.Dataset.from\_tensor\_slices(
(
{'input\_ids': tokenized\_inputs['input\_ids'],
'attention\_mask': tokenized\_inputs['attention\_mask']},
labels
)
).shuffle(buffer\_size=len(DATA)).batch(BATCH\_SIZE)

print("\\n--- Starting Fine-Tuning ---")

optimizer \= tf.keras.optimizers.Adam(learning\_rate=LEARNING\_RATE)

model.compile(optimizer=optimizer)

history \= model.fit(
dataset,
epochs=EPOCHS,
verbose=1
)

print("--- Fine-Tuning Complete ---")

print("\\n--- Testing Model Generation ---")

test\_term\_1 \= "term: Touch base"
test\_input\_1 \= tokenizer(test\_term\_1, return\_tensors='tf').input\_ids

output\_tokens\_1 \= model.generate(test\_input\_1, max\_length=MAX\_LENGTH)
decoded\_meaning\_1 \= tokenizer.decode(output\_tokens\_1[0], skip\_special\_tokens=True)

print(f"Input: '{test\_term\_1}'")
print(f"Output: '{decoded\_meaning\_1}'")

test\_term\_2 \= "term: Alignment"
test\_input\_2 \= tokenizer(test\_term\_2, return\_tensors='tf').input\_ids
output\_tokens\_2 \= model.generate(test\_input\_2, max\_length=MAX\_LENGTH)
decoded\_meaning\_2 \= tokenizer.decode(output\_tokens\_2[0], skip\_special\_tokens=True)

print(f"\\nInput: '{test\_term\_2}'")
print(f"Output: '{decoded\_meaning\_2}'")

MODEL\_SAVE\_PATH \= os.path.join(BASE\_DIR, "1")
os.makedirs(MODEL\_SAVE\_PATH, exist\_ok=True)

model.save(MODEL\_SAVE\_PATH, save\_format='tf')
tokenizer.save\_pretrained(MODEL\_SAVE\_PATH)
print(f"\\nModel saved to: {MODEL\_SAVE\_PATH}")

This script does four things: it loads your training data from a JSON file, tokenizes the inputs and targets for T5, fine-tunes the model for 15 epochs, and saves the trained weights along with the tokenizer to a directory called 1 in your project root.

It is important to save your model in a numbered directory or version number, as the Tensorflow Kserve program, expects to find your model in this format. Anything that deviates from this will prevent your Kserve inference service from working.

Train the Model

To train your model, run the following command from the root directory:

python3 app/train\_llm.py

The training process will kick off, and you’ll see output showing the model loading, training progress across epochs, test predictions, and finally confirmation that the model has been saved. When complete, you’ll have a new directory called 1 containing your model’s saved weights (saved_model.pb), variables, tokenizer config files, and all the assets TensorFlow needs to reload and serve your model later.

Testing the Model with FastAPI

Before we package our model for production, let’s make sure it actually works. We’ll build a simple FastAPI inference server that loads the trained model and exposes an endpoint for predictions.

Create the Inference Server

In your app directory, create a file called inference.py and add this code:

import os
import tensorflow as tf
from transformers import T5Tokenizer, TFT5ForConditionalGeneration
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn

app \= FastAPI(
title="Jargon Decoder LLM API",
description="A service to translate corporate jargon using a fine-tuned T5 model.",
version="1.0.0"
)

tokenizer \= None
model \= None
MAX\_LENGTH \= 128

BASE\_DIR \= os.path.dirname(os.path.dirname(os.path.abspath(\_\_file\_\_)))
MODEL\_SAVE\_PATH \= os.path.join(BASE\_DIR, "1")

@app.on\_event("startup")
async def load\_model\_on\_startup():
"""Loads the fine-tuned T5 model and tokenizer when the FastAPI application starts."""
global tokenizer, model

print(f"Base Directory: {BASE\_DIR}")
print(f"Attempting to load model from: {MODEL\_SAVE\_PATH}")

try:
tokenizer \= T5Tokenizer.from\_pretrained(MODEL\_SAVE\_PATH)
model \= TFT5ForConditionalGeneration.from\_pretrained(MODEL\_SAVE\_PATH)
print("Model and tokenizer loaded successfully\! 🚀")
except Exception as e:
print(f"FATAL ERROR: Could not load model from {MODEL\_SAVE\_PATH}.")
print(f"Details: {e}")

class JargonRequest(BaseModel):
"""Schema for the input request."""
term: str \= "Circle back"

class JargonResponse(BaseModel):
"""Schema for the output response."""
original\_term: str
decoded\_meaning: str

def decode\_jargon(term: str, tokenizer, model) -> str:
"""
Core function to run inference on the loaded LLM.
"""
if not tokenizer or not model:
raise HTTPException(status\_code=503, detail="Model is not loaded or ready.")

prompt \= f"term: {term}"


input\_ids \= tokenizer(
prompt,
return\_tensors='tf',
max\_length=MAX\_LENGTH,
padding='max\_length',
truncation=True
).input\_ids


output\_tokens \= model.generate(
input\_ids,
max\_length=MAX\_LENGTH
)


decoded\_meaning \= tokenizer.decode(output\_tokens[0], skip\_special\_tokens=True)


if decoded\_meaning.startswith("meaning: "):
return decoded\_meaning[9:].strip()

return decoded\_meaning.strip()

@app.post("/decode/", response\_model=JargonResponse)
async def decode(request: JargonRequest):
"""
API endpoint to translate a corporate jargon term into plain meaning.
"""
try:
meaning \= decode\_jargon(request.term, tokenizer, model)
return JargonResponse(
original\_term=request.term,
decoded\_meaning=meaning
)
except HTTPException as e:
## Re-raise explicit HTTP exceptions
raise e
except Exception as e:
## Handle unexpected errors
print(f"Inference Error: {e}")
raise HTTPException(status\_code=500, detail=f"Internal server error during inference: {e}")

if \_\_name\_\_ \== "\_\_main\_\_":
uvicorn.run("inference:app", host="0.0.0.0", port=8000, reload=True)

This inference script sets up a FastAPI application that loads your fine-tuned T5 model on startup. The load_model_on_startup function pulls the tokenizer and model from the saved directory, making them available globally. The decode_jargon function handles the actual inference: it takes a corporate term, formats it as a prompt, runs it through the model, and returns the decoded meaning.

The /decode/ endpoint accepts POST requests with a jargon term and responds with the plain-language translation. Pydantic models ensure type safety for requests and responses, while error handling catches issues like missing models or inference failures.

Start the Server

Run the inference server from your project root:

python3 app/inference.py

You’ll see output showing the model loading and a confirmation that the FastAPI server is running on http://0.0.0.0:8000. The startup event will trigger immediately, pulling your trained weights into memory so they’re ready for inference requests.

Test the Endpoint

To test the endpoint, open a new terminal and send a test request with curl:

curl -X POST "http://localhost:8000/decode/" \\
-H "Content-Type: application/json" \\
-d '{"term": "Synergy"}'

If everything is working, you should see a JSON response with the decoded meaning:

{
"original\_term": "Synergy",
"decoded\_meaning": "Synergy"
}

The code and model is working and producing an output which is what we expect. Now that we’ve confirmed everything works locally, we can package the entire application code, model, and dependencies into a ModelKit for production deployment.

Packaging with KitOps

To make the workflow repeatable and production ready we’ll use KitOps to bundle our trained model, inference code, and training data into a single ModelKit.

Initialize the Kitfile

From your project root directory, run:

kit init .

This creates a Kitfile in your current directory. A Kitfile is a YAML manifest that describes everything needed to reproduce your ML project—model weights, code paths, datasets, and metadata. Think of it like a Dockerfile, but designed specifically for machine learning artifacts. It tells KitOps what to bundle into your ModelKit and how those pieces fit together.

Edit the Kitfile

The generated Kitfile is a good starting point, but it doesn’t capture the full structure of our project. Open the Kitfile and replace its contents with this:

manifestVersion: 1.2.0

package:
name: corporate-speak-model
description: A lightweight language model fine-tuned on corporate jargon to explain complex corporate terms in simple English.
authors: [Thoren Oakenshield]

code:
- path: .
description: All necessary scripts, configurations, and application logic

model:
name: T5
path: ./1/
framework: Tensorflow
version: 1.2.0
description: A lightweight language model fine-tuned on corporate jargon to explain complex corporate terms in simple English.

datasets:
- name: corporate-jargon-data
path: ./data/
description: A small JSON dataset containing corporate terms and their real-world meanings.

Let’s break down what this Kitfile does. The package section holds metadata which are the model name, a description, and the author. Next, the code section points to your entire project directory, capturing all your scripts, configuration files, and application logic.

Then, the model section specifies where your trained T5 weights live (the ./1/ directory we created during training), what framework they use, and the version. Finally, the datasets section references your training data in ./data/, so anyone pulling this ModelKit knows exactly what data was used to train the model. This single file gives you a complete snapshot of your ML project.

Pack the ModelKit

Now let’s bundle everything into a ModelKit, similar to how you build a Docker image. To pack your ModelKit run:

kit pack . -t jozu.ml/<username>/<model-kit-name>:<version>

Replace with your Jozu username and : with your model kit name and version. This command reads your Kitfile, collects all the referenced files (code, model weights, data), and packages them into a single OCI-compliant artifact. You’ll see output showing KitOps compressing and layering your files.

Push to Jozu

Once the pack completes, push your ModelKit to Jozu by running:

kit push jozu.ml/<username>/<model-kit-name>:<version>

The CLI uploads your ModelKit layers to the registry. When it finishes, head to your Jozu account at jozu.ml, click on My Repositories, and you should see your newly pushed package listed.

Setting Up the Serving Infrastructure

Before we can deploy our model with KServe, we need to set up the complete infrastructure stack. This includes Docker for containerization, Kubernetes for orchestration, Kubeflow for ML workflows, and KServe for model serving. Let’s walk through each installation step by step.

Install Docker

Docker is the container runtime that Minikube will use. If you’re on Linux, run:

sudo apt-get update && sudo apt-get install docker.io -y
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker

For macOS or Windows users, head to the official Docker website and follow the installation instructions for your operating system.

Install kubectl

kubectl is the command-line tool for interacting with Kubernetes clusters. It lets you deploy applications, inspect resources, and manage cluster operations.

To Install it run:

sudo snap install kubectl --classic
kubectl version --client  ## Verify installation

Install Minikube

Next is Minikube. Minikube runs a local Kubernetes cluster on your machine which is perfect for development and testing without needing cloud resources. TO download and install it, run:

curl -LO https://github.com/kubernetes/minikube/releases/latest/download/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube && rm minikube-linux-amd64
minikube version

Start Minikube

It’s important to start your local Kubernetes cluster with enough resources to handle model serving else your cluster will fail in the process of serving your model. To start minikube run:

minikube start --cpus=4 --memory=10240 --driver=docker
kubectl get nodes
kubectl cluster-info

This spins up a single-node cluster with 4 CPUs and 10GB of memory. The kubectl get nodes command confirms your cluster is running, and kubectl cluster-info shows the control plane endpoint.

Install Kubeflow Pipelines

Kubeflow is an open-source platform for running ML workflows on Kubernetes. It provides tools for orchestrating complex pipelines, tracking experiments, and managing model training. We’ll install Kubeflow Pipelines, which handles the deployment and serving orchestration:

export PIPELINE\_VERSION=2.4.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE\_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE\_VERSION"

This installation can take a few minutes. To check if all components are ready, run:

kubectl get pods -n kubeflow

Wait until all pods show Running status. You should see output similar to this:

NAME                                               READY   STATUS    RESTARTS      AGE
cache-deployer-deployment-85b76bcb6-fmslx          1/1     Running   0             21h
cache-server-66bd9b7875-rxdvl                      1/1     Running   0             21h
metadata-envoy-deployment-746744dfb8-zdgtx         1/1     Running   0             21h
metadata-grpc-deployment-54654fc5bb-9cvdg          1/1     Running   6 (21h ago)   21h
metadata-writer-68658fdf4b-7zpbn                   1/1     Running   1 (20h ago)   21h
minio-85cd46c575-gt7kp                             1/1     Running   0             21h
ml-pipeline-6978d6f776-p4zt9                       1/1     Running   3 (20h ago)   21h
ml-pipeline-persistenceagent-7d4c675666-28qnz      1/1     Running   1 (20h ago)   21h
ml-pipeline-scheduledworkflow-695b7b8988-swzdj     1/1     Running   0             21h
ml-pipeline-ui-88467988b-4c6md                     1/1     Running   0             21h
ml-pipeline-viewer-crd-bf5dc64dd-5xqv9             1/1     Running   0             21h
ml-pipeline-visualizationserver-5584ff64d7-jr686   1/1     Running   0             21h
mysql-6745b5984c-dn4r6                             1/1     Running   0             21h
workflow-controller-5b84568b94-tjjcz               1/1     Running   0             21h

Install KServe

KServe is a Kubernetes-native platform for serving ML models. It handles autoscaling, canary rollouts, and provides a unified inference protocol across different model frameworks. You can install it with:

curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.14/hack/quick\_install.sh" | bash

Once the installation completes, verify that KServe and its dependencies are running with the following commands:

kubectl get pods -n kserve
kubectl get pods -n istio-system
kubectl get pods -n knative-serving

You should see output confirming all components are operational:

NAME                                        READY   STATUS    RESTARTS   AGE
kserve-controller-manager-86869697f-mcgrd   2/2     Running   0          20h

NAME                                    READY   STATUS    RESTARTS   AGE
istio-ingressgateway-698fff54fb-bbqh7   1/1     Running   0          20h
istiod-7fdcb55c9c-qtwf5                 1/1     Running   0          20h

NAME                                    READY   STATUS    RESTARTS   AGE
activator-5967d4d645-fgfhw              1/1     Running   0          20h
autoscaler-598c65f5bc-9pdt4             1/1     Running   0          20h
autoscaler-hpa-5b45c655dc-hx4qd         1/1     Running   0          20h
controller-7cf55b567b-x45bn             1/1     Running   0          20h
knative-operator-76b6894f45-58xlt       1/1     Running   0          20h
net-istio-controller-54b458f57b-7cqj7   1/1     Running   0          20h
net-istio-webhook-7bc64cfff6-mslz9      1/1     Running   0          20h
operator-webhook-565c994ff9-f7hzq       1/1     Running   0          20h
webhook-7f575896d6-gc4qc                1/1     Running   0          20h

Create Registry Credentials

KServe needs credentials to pull your ModelKit from Jozu. To set up these credentials in your project directory, create a file called kitops-jozu-secret.yaml and add the following:

apiVersion: v1
kind: Secret
metadata:
name: jozu-registry-secret
type: Opaque
data:
KIT\_USER: <YOUR USERNAME ENCODED IN BASE 64>
KIT\_PASSWORD: <YOUR PASSWORD ENCODED IN BASE 64>

Replace the base64-encoded values with your own Jozu credentials. You can encode your username and password by running:

echo -n "your-username" | base64
echo -n "your-password" | base64

Serving the Model with KServe

Now that our infrastructure is ready and our ModelKit is in the registry, let’s deploy it with KServe. This section walks through configuring KServe to pull ModelKits, defining the inference service, and making predictions against the deployed endpoint.

Configure the Storage Initializer

KServe uses storage initializers to fetch model artifacts from registries before starting the inference container. We need to tell KServe how to pull ModelKits using the KitOps storage initializer. To do this create a file called kitops-storage-initializer.yaml:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterStorageContainer
metadata:
name: kitops
spec:
container:
name: storage-initializer
image: ghcr.io/kitops-ml/kitops-kserve:latest
imagePullPolicy: Always
env:
- name: KIT\_UNPACK\_FLAGS
value: ""
- name: KIT\_USER
valueFrom:
secretKeyRef:
name: jozu-registry-secret
key: KIT\_USER
optional: true
- name: KIT\_PASSWORD
valueFrom:
secretKeyRef:
name: jozu-registry-secret
key: KIT\_PASSWORD
optional: true
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 1Gi
supportedUriFormats:
- prefix: kit://

This ClusterStorageContainer defines a custom storage initializer that understands kit:// URIs. When KServe sees a storageUri starting with kit://, it uses this initializer to authenticate with Jozu (via the credentials in kit-secret), pull the ModelKit, unpack it, and mount the model artifacts into the inference container. The resource limits ensure the initializer doesn’t consume too much memory during the download and unpacking phase.

Create the InferenceService

An InferenceService is KServe’s core resource for deploying models. It handles routing, autoscaling, canary deployments, and connects your model to a scalable serving runtime. Create a file called kitops-kserve-inference.yaml:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: corporate-speak-model-tensorflow
spec:
predictor:
model:
modelFormat:
name: tensorflow
resources:
requests:
cpu: "250m"
memory: "1Gi"
limits:
cpu: "500m"
memory: "2Gi"
storageUri: kit://jozu.ml/<username>/<model-kit-name>:<version>

Replace the storageUri with your actual ModelKit reference from Jozu (username, repository name, and tag). The modelFormat: tensorflow tells KServe to use the TensorFlow serving runtime, while the resource requests and limits ensure your model has enough CPU and memory to handle inference without monopolizing cluster resources.

Deploy the Service

Apply all three manifests to your cluster:

kubectl apply -f kitops-jozu-secret.yaml
kubectl apply -f kitops-storage-initializer.yaml
kubectl apply -f kitops-kserve-inference.yaml

If successful, you’ll see:

secret/jozu-registry-secret
clusterstoragecontainer.serving.kserve.io/kitops created
inferenceservice.serving.kserve.io/corporate-speak-model-tensorflow created

The deployment takes a few minutes as KServe pulls the ModelKit, unpacks it, and starts the inference pod. You can monitor the progress with:

kubectl get pods

Wait until you see your predictor pod running:

NAME                                                              READY   STATUS    RESTARTS   AGE
corporate-speak-model-tensorflow-predictor-00001-deploymenwcc2n   2/2     Running   0          2m

Access the Inference Endpoint

Once the pod is running, find the service endpoint. You can do this by running:

kubectl get services | grep corporate-speak-model-tensorflow

You’ll see several services created by KServe:

corporate-speak-model-tensorflow                           ExternalName   <none>           knative-local-gateway.istio-system.svc.cluster.local   <none>                                               20h
corporate-speak-model-tensorflow-predictor                 ExternalName   <none>           knative-local-gateway.istio-system.svc.cluster.local   80/TCP                                               20h
corporate-speak-model-tensorflow-predictor-00001           ClusterIP      10.103.234.235   <none>                                                 80/TCP,443/TCP                                       20h
corporate-speak-model-tensorflow-predictor-00001-private   ClusterIP      10.104.180.43    <none>                                                 80/TCP,443/TCP,9090/TCP,9091/TCP,8022/TCP,8012/TCP   20h

For local testing, forward the private service to your machine:

kubectl port-forward service/corporate-speak-model-tensorflow-predictor-00001-private 8080:80

You should see:

Forwarding from 127.0.0.1:8080 -> 8012
Forwarding from [::1]:8080 -> 8012

Now you can test your inference service.

Testing the Deployment with Tokenized Input

Before testing it is important to know that, KServe’s standard TensorFlow serving runtime expects numerical tensors that correspond to the model’s signature. Since our T5 model was fine-tuned using token IDs, we must tokenize the input locally before sending the request.

First, you’ll need a quick script to generate the correct numerical payload. To do this, create a temporary Python script generate\_payload.py in your project root to handle the tokenization and generate the JSON payload:


import tensorflow as tf ## Required for Tensors
from transformers import T5Tokenizer
import json
import os

MODEL\_SAVE\_PATH \= os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(\_\_file\_\_))), "1")
tokenizer \= T5Tokenizer.from\_pretrained(MODEL\_SAVE\_PATH)
MAX\_LENGTH \= 128
term \= "Synergy" ## You can change the term here
prompt \= f"term: {term}" ## T5 was trained to expect this prefix

inputs \= tokenizer(
prompt,
return\_tensors='tf',
max\_length=MAX\_LENGTH,
padding='max\_length',
truncation=True
)

input\_ids\_list \= inputs['input\_ids'][0].numpy().tolist()
attention\_mask\_list \= inputs['attention\_mask'][0].numpy().tolist()

payload \= {
"instances": [
{
"input\_ids": input\_ids\_list,
"attention\_mask": attention\_mask\_list ## KServe needs both for attention
}
]
}

with open('test\_payload.json', 'w') as f:
json.dump(payload, f, indent=2)

In a new terminal, run the script to create the file:

python3 generate\_payload.py

Now, use curl to send the generated test_payload.json file to the KServe endpoint.

curl -X POST http://localhost:8080/v1/models/corporate-speak-model-tensorflow:predict \\
-H "Content-Type: application/json" \\
-d @test\_payload.json

KServe will route the request containing the numerical IDs to the TensorFlow serving runtime, which passes it directly to the T5 model’s generation function. You should see a JSON response with the decoded meaning:

{
"predictions": [
{
"output": "Synergy"
}
]
}

Scaling and Securing Your Deployment

Running a model in production requires thinking beyond basic functionality. As time goes on you will need autoscaling to handle traffic spikes, resource limits to prevent runaway costs, and security measures to protect your models and data. KServe and KitOps give you the tools to handle all of this without the need to build custom infrastructure.

Autoscaling with KServe

KServe integrates with Knative Serving to provide automatic scaling based on request load. By default, your InferenceService will scale down to zero replicas when idle and scale up as traffic increases. You can customize this behavior by adding autoscaling annotations to your InferenceService manifest.

To do this, edit your kitops-kserve-inference.yaml to include autoscaling configuration:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: corporate-speak-model-tensorflow
annotations:
autoscaling.knative.dev/target: "10"
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "5"
spec:
predictor:
model:
modelFormat:
name: tensorflow
resources:
requests:
cpu: "250m"
memory: "1Gi"
limits:
cpu: "500m"
memory: "2Gi"
storageUri: kit://jozu.ml/<username>/<model-kit-name>:<version>

The target annotation sets the concurrency target per pod (10 requests), minScale ensures at least one pod is always running for faster response times, and maxScale caps the maximum number of replicas to 5, preventing runaway scaling costs. Knative will automatically add or remove pods based on incoming traffic patterns.

Resource Management

The resource limits in your InferenceService prevent a single model from consuming all cluster resources. The requests section tells Kubernetes how much CPU and memory to reserve, while limits sets the maximum the pod can use. For production deployments, you can tune these values based on your model’s actual memory footprint and inference latency requirements.

If you’re running multiple models, consider creating separate namespaces for isolation:

kubectl create namespace production-models
kubectl apply -f kitops-kserve-inference.yaml -n production-models

This keeps production models separate from staging or experimental deployments and makes it easier to apply different resource quotas and network policies per environment.

Securing ModelKits with Cosign

ModelKit signing ensures that the artifacts you deploy haven’t been tampered with between packaging and deployment. You can use Cosign to sign your ModelKits immediately after pushing them to Jozu:

cosign generate-key-pair
cosign sign jozu.ml/<username>/<model-kit-name>:<version> --key cosign.key

This creates a cryptographic signature attached to your ModelKit. In production, you can configure KServe to verify signatures before pulling models, rejecting any unsigned or tampered artifacts. The signature verification happens during the storage initialization phase, before the model ever loads into memory.

Model Versioning and Rollback

One of KitOps’ biggest advantages is version control for models. Every ModelKit you push to Jozu is immutable and tagged. If a new model version causes issues in production, rolling back is as simple as updating the storageUri in your InferenceService:

storageUri: kit://jozu.ml/<username>/<model-kit-name>:<the-previous-version>

Note: When a ModelKit is pushed to Jozu, it is automatically run through 5 different vulnerability scanning tools to ensure that your model is safe and secure. Jozu also creates a downloadable audit log, consisting of the model’s complete lineage.

Apply the change, and KServe will perform a blue-green deployment, spinning up new pods with the old model version while draining traffic from the problematic version. You can also use KServe’s canary deployment features to test new model versions with a percentage of traffic before fully rolling out:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: corporate-speak-model-tensorflow
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: kit://jozu.ml/<username>/<model-kit-name>:<a-new-version>
canaryTrafficPercent: 20

This routes 20% of traffic to the new model while keeping 80% on the stable version. Monitor metrics, and if everything looks good, increase the percentage until you’re confident enough to promote the canary to full production.

Wrapping Up

Having a good model isn’t enough to serve machine learning applications at scale. The combination of KitOps, Kubeflow, KServe, and Jozu brings software development best practices, like containerization, version control, and automated scaling, into the ML workflow. KitOps standardizes your LLM into a portable ModelKit for reproducible packaging and security, while KServe handles reliable, production-grade serving and automated scaling on Kubernetes, eliminating the need for custom engineering.

This guide demonstrated how to build a TensorFlow LLM, package it with KitOps, push it to an OCI registry, and deploy it using KServe on Kubernetes. The steps covered key operational patterns like configuring autoscaling, securing ModelKits with signatures, managing resource allocation across environments, and performing deployment rollbacks. This consistent methodology scales effortlessly from development environments like Minikube to high-volume production clusters like EKS, GKE, or on-premises systems.

To learn more about KitOps visit kitops.org. To try Jozu Hub in your private environment, you can contact the Jozu team to start a free two-week POC.

Introduction

Introduction

Learning Objectives

Prerequisites and Setup

Install Python

Install the KitOps CLI

Sign Up for Jozu

Authenticate with Jozu

Building a TensorFlow LLM Model

Set Up Your Project Directory

Install Dependencies

Create the Training Data

Create the Training Script

Train the Model

Testing the Model with FastAPI

Create the Inference Server

Start the Server

Test the Endpoint

Packaging with KitOps

Initialize the Kitfile

Edit the Kitfile

Pack the ModelKit

Push to Jozu

Setting Up the Serving Infrastructure

Install Docker

Install kubectl

Install Minikube

Start Minikube

Install Kubeflow Pipelines

Install KServe

Create Registry Credentials

Serving the Model with KServe

Configure the Storage Initializer

Create the InferenceService

Deploy the Service

Access the Inference Endpoint

Testing the Deployment with Tokenized Input

Scaling and Securing Your Deployment

Autoscaling with KServe

Resource Management

Securing ModelKits with Cosign

Model Versioning and Rollback

Wrapping Up

Similar Posts