Organizations that want to run large language models (LLMs) on their own infrastructureâwhether in private data centers or in the cloudâoften face significant challenges related to GPU availability, capacity, and cost.
For example, models like Qwen3-Coder-30B-A3B-Instruct offer strong code-generation capabilities, but the memory footprint of larger models makes them difficult to serve efficiently, even on modern GPUs. This particular model requires multiple NVIDIA L40S GPUs using tensor parallelism. The problem becomes even more complex when supporting long context windows (which are essential for coding assistants or other large-context tasks like retrieval-augmented generation, or RAG). In these cases, the key-value (KâŚ
Organizations that want to run large language models (LLMs) on their own infrastructureâwhether in private data centers or in the cloudâoften face significant challenges related to GPU availability, capacity, and cost.
For example, models like Qwen3-Coder-30B-A3B-Instruct offer strong code-generation capabilities, but the memory footprint of larger models makes them difficult to serve efficiently, even on modern GPUs. This particular model requires multiple NVIDIA L40S GPUs using tensor parallelism. The problem becomes even more complex when supporting long context windows (which are essential for coding assistants or other large-context tasks like retrieval-augmented generation, or RAG). In these cases, the key-value (KV) cache alone can consume gigabytes of GPU memory.
To address these challenges, you can compress the model through quantization. This process reduces the modelâs memory footprint by compressing its numerical weights to lower-precision values. However, compression requires careful evaluation. We must ensure the quantized model remains viable using benchmarking tools specializing in code-specific tasks.
Once the model is validated, the next challenge is how to package and version it for reproducibility and reusability. You must then deploy the model to GPU-enabled infrastructure, such as Red Hat OpenShift AI, where it can be served efficiently using runtimes like vLLM.
In the pipeline for this article, we used the LLM Compressor from Red Hat AI Inference Server to quantize Qwen3-Coder-30B-A3B-Instruct using activation-aware quantization (AWQ), which redistributes weight scales to minimize quantization error. This approach enables single-GPU serving with strong accuracy retention.
We used the benchmark tools lm_eval and GuideLLM to determine the accuracy of the quantized model against code-focused benchmarks. We also measured its runtime performance on a single GPU compared to an unquantized, multi-GPU baseline.
Figure 1 shows a summary of the quantization and benchmarking results.
Figure 1: Quantization drastically reduces the modelâs file size (from 63.97 GB to 16.69 GB) and significantly improves efficiency, resulting in lower latency (TTFT) under high load, without a meaningful loss in performance. Weâll examine the results in detail later, but the overview shows that using the right compression and validation tools allows you to deploy LLMs efficiently on less infrastructure without sacrificing performanceâin this case, actually improving both performance and accuracy.
Workflow
The workflow to quantize, evaluate, package, and deploy an LLM can be broken down into the following stages:
- Model download and conversion
- Quantization
- Validation and evaluation
- Packaging in ModelCar format
- Pushing to model registry
- Deployment on OpenShift AI with vLLM
- Performance benchmarking The model-car-importer repository contains an example pipeline that performs these tasks.
Stage 1: Model download and conversion
The pipeline begins by fetching the files from the Qwen3-Coder-30B-A3B-Instruct repository on Hugging Face. This task pulls down the model weights and configuration files from the Hugging Face Hub into shared workspace storage.
Stage 2: Quantization
This stage uses the LLM Compressor from Red Hat AI Inference Server to quantize the downloaded model. For this exercise, we use AWQ quantization. This approach compresses model weights to 4 bits in an activation-aware way, preserving numerical fidelity and inference stability better than naive quantization.
This approach is ideal for serving large models like Qwen3-Coder-30B-A3B-Instruct on constrained GPU infrastructure because it significantly reduces memory usage while maintaining accuracy. By using AWQ, enterprises can deploy advanced LLMs more efficiently on hardware such as NVIDIA L40 GPUs.
Stage 3: Evaluation and benchmarking
Compression through quantization requires verification to ensure performance does not degrade significantly. The pipeline integrates benchmark tooling, such as the language model evaluation harness (lm_eval
), to validate the quantized modelâs accuracy on domain-specific tasks like code generation (for example, HumanEval).
In addition to running benchmarks, the pipeline also uses GuideLLM to assess the quantized modelâs performance and resource requirements.
The metrics from this stage can help determine if the quantized model is production-ready.
Stage 4: Packaging with ModelCar
Once validated, the model is packaged using the ModelCar format for versioned, OCI-compatible LLM deployment. ModelCar images ensure reproducibility and versioned model releases.
Stage 5: Pushing to an OCI registry
Once the model is packaged in ModelCar format, the pipeline pushes the OCI image to an OCI registry like Quay.io.
Stage 6: Deployment to OpenShift AI (with vLLM)
The final deployment step involves configuring an OpenShift AI ServingRuntime
using vLLM and deploying the image from the ModelCar OCI image. This allows the model to be served behind an OpenShift Route, with native GPU scheduling, autoscaling, and monitoring via Prometheus and Grafana.
Deploying the ModelCar pipeline on OpenShift
To get started with optimizing and deploying a large code model like Qwen3-Coder-30B-A3B-Instruct using the model-car-importer pipeline, you can use the following PipelineRun
specification. This configuration handles the full lifecycle: downloading the model, quantizing it using AWQ, evaluating it on code-specific tasks, packaging it as a ModelCar, and deploying it to OpenShift AI with model registry integration.
Next, weâll walk through a quick summary of the steps.
Prerequisites
- OpenShift AI cluster with a GPU-enabled node (for example, an AWS EC2 g6e.12xlarge instance providing 4 NVIDIA L40 Tensor Core GPUs with 48 GB vRAM each)
- Access to Quay.io (for pushing images)
- Access to Hugging Face (for downloading models)
- OpenShift AI model registry service
- OpenShift CLI (
oc
)
1. Set up your environment
Clone the code from https://github.com/rh-aiservices-bu/model-car-importer/tree/main.
Follow the steps in the README to install the pipeline, up to the creation of the PipelineRun
.
2. Set required environment variables
Before creating the PipelineRun, define the required variables in your environment:
# Hugging Face
export HUGGINGFACE_MODEL="Qwen/Qwen3-Coder-30B-A3B-Instruct"
# Model details
export MODEL_NAME="Qwen3-Coder-30B-A3B-Instruct"
export MODEL_VERSION="v1.0.0"
export QUAY_REPOSITORY="quay.io/your-org/your-modelcar-repo"
export MODEL_REGISTRY_URL="your-openshift-ai-model-registry"
export HF_TOKEN="your-huggingface-token" # used via secret
3. Create the compression script
The repository contains compress-code.py, which runs compression using specialized coding datasets for calibration, in this case the codeparrot/self-instruct-starcoder dataset.
The following recipe is used to configure the AWQModifier
, using this example.
recipe = \[
AWQModifier(
duo_scaling=False,
ignore=\[
"lm_head",
"re:.*mlp.gate$",
"re:.*mlp.shared_expert_gate$"
\],
scheme="W4A16",
targets=\["Linear"\],
),
\]
Update the evaluate-script
ConfigMap to use this script:
oc create configmap compress-script
--from-file=compress.py=tasks/compress/compress-code.py
4. Run the pipeline
Run the following command to deploy the PipelineRun
:
cat <<EOF | oc create -f -
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
name: modelcar-pipelinerun
spec:
pipelineRef:
name: modelcar-pipeline
timeout: 24h # 24-hour timeout
serviceAccountName: modelcar-pipeline
params:
- name: HUGGINGFACE_MODEL
value: "${HUGGINGFACE_MODEL}"
- name: OCI_IMAGE
value: "${QUAY_REPOSITORY}"
- name: HUGGINGFACE_ALLOW_PATTERNS
value: "*.safetensors *.json *.txt *.md *.model"
- name: COMPRESS_MODEL
value: "true"
- name: MODEL_NAME
value: "${MODEL_NAME}"
- name: MODEL_VERSION
value: "${MODEL_VERSION}"
- name: MODEL_REGISTRY_URL
value: "${MODEL_REGISTRY_URL}"
- name: DEPLOY_MODEL
value: "true"
- name: EVALUATE_MODEL
value: "true"
- name: GUIDELLM_EVALUATE_MODEL
value: "true"
- name: MAX_MODEL_LEN
value: 16000
# - name: SKIP_TASKS
# value: "cleanup-workspace,pull-model-from-huggingface,compress-model,evaluate-model,build-and-push-modelcar,register-with-registry"
workspaces:
- name: shared-workspace
persistentVolumeClaim:
claimName: modelcar-storage
- name: quay-auth-workspace
secret:
secretName: quay-auth
podTemplate:
securityContext:
runAsUser: 1001
fsGroup: 1001
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
nvidia.com/gpu.present: "true"
EOF
What this pipeline does:
- Downloads the model from Hugging Face.
- Applies AWQ quantization for efficient GPU serving.
- Evaluates the quantized model using a custom code-focused evaluation script, tailored for programming tasks (for example, HumanEval).
- Sets a large context window via
MAX_MODEL_LEN=16000
, optimizing the model for longer code completions. - Packages and pushes the model as a ModelCar to an OCI registry (such as Quay).
- Registers the model in the OpenShift AI Model Registry.
- Deploys the model to OpenShift AI.
- Deploys
AnythingLLM
connected to the model. - Performs performance benchmarking with GuideLLM. Once the pipeline is complete, you should see a completed pipeline run, as shown in Figure 2.
Figure 2: A successful execution run of the pipeline.
Results
Hereâs an overview of the results from the model compression and testing.
File size reduction
The compression resulted in model weights reduction from 64 Gb to 16.7 Gb.
Figure 3: Model file sizes: Quantized versus unquantized.
Model evaluation
For HumanEval (base tests), the quantized Qwen3-Coder-30B-A3B-Instruct achieved pass@1 = 0.933.
For comparison, the unquantized model achieved pass@1 â 0.930 on the same benchmark.
Figure 4: HumanEval (base tests) for the unquantized and quantized models. Running the same evaluations on the quantized model produced a 93.3% pass@1 on HumanEval, a slight increase on accuracy compared to the unquantized model.
Model performance
We used GuideLLM to perform performance testing against the model deployed on VLLM.
The GuideLLM benchmarks highlight a clear efficiency advantage for the quantized model. Despite running on just one NVIDIA L40S GPU (versus four GPUs for the unquantized baseline), the quantized model achieves approximately 33 percent higher maximum throughput (around 8,056 versus 6,032 tokens per second) and sustains lower latencies across most constant-load tests. (See Figure 5.)
Figure 5: Max throughput: Quantized versus unquantized. Time to First Token (TTFT) is also consistently reduced, with the quantized model staying well below the multi-GPU unquantized setup. (See Figure 6.)
Figure 6: Requests per second versus TTFT.
Code assistant integration
Once the model is deployed, it can be used by coding assistants such as continue.dev as shown in Figure 7. Configurations will vary, but once the coding assistant allows for configuration of OpenAI API-compatible models, you should be able to configure the assistant to use the model weâve deployed to OpenShift AI.
Figure 7: A code assistant using the model that was deployed to OpenShift AI.
Summary
In this post, we walked through the end-to-end process of optimizing and deploying a large code-generation modelâQwen3-Coder-30B-A3B-Instructâfor enterprise environments. We addressed the key challenges of serving LLMsânamely memory constraints, reproducibility, and deployment at scaleâand showed how AWQ quantization enables significant compression without performance trade-offs.
We then explored how to automate the entire workflow using a pipeline on OpenShift AI: downloading and quantizing the model, evaluating its performance with a code evaluation harness, and packaging it in the ModelCar format for versioned delivery. With integration to an OCI registry and model registry, and support for high-performance runtimes like vLLM, this approach turns complex LLM deployment into a repeatable, production-ready process.