ClipTagger-12B VLM: Frame Captioning Tutorial

TL;DR

The inference-net/ClipTagger-12b is a Gemma-3-12B based VLM with an Apache-2.0 license. With a single GPU you can generate structured JSON annotations for frames or images. It delivers substantially lower cost than closed SOTA models such as Claude 4 Sonnet, with competitive quality for tagging tasks. Details and benchmarks in inference.net blog post.

Requirements

GPU
CUDA 12.x runtime
Python 3.10–3.12
Disk: ~20–30 GB free (model + deps + cache)

Optional but recommended: ffmpeg for extracting video frames.

GPU note. Per the model card, ClipTagger-12B targets FP8-optimized GPUs (RTX 40-series, H10…

TL;DR

Requirements

GPU
CUDA 12.x runtime
Python 3.10–3.12
Disk: ~20–30 GB free (model + deps + cache)

Optional but recommended: ffmpeg for extracting video frames.

GPU note. Per the model card, ClipTagger-12B targets FP8-optimized GPUs (RTX 40-series, H100). This tutorial shows BF16 inference in PyTorch for portability. It was tested on a single NVIDIA H200 (SXM5). For smaller VRAM budgets, use quantization or for e.g. TensorRT-LLM/FP8.

Quickstart (PyTorch, BF16, single GPU)

1) Create environment and install CUDA-enabled PyTorch + deps

python3 -m venv .venv
source .venv/bin/activate
pip install torch torchvision torchaudio "transformers>=4.44" accelerate pillow safetensors compressed-tensors

2) Prepare test image

Use a small JPEG/PNG (≤1 MB, as per the model card). This example uses a CC0 licensed picture from magdeleine.co from this post:

curl -L -o frame.jpg \
https://magdeleine.co/wp-content/uploads/2022/11/43865945842_6f89d901fc_o-1.jpg

If your frames are large, you can downscale/compress to keep under ~1 MB, for example:

ffmpeg -y -i input.jpg -vf "scale=1280:-1,setsar=1" -pix_fmt yuv420p -q:v 3 -frames:v 1 -update 1 frame.jpg

3) Run inference (BF16):

Create run_cliptagger.py:

import json
from pathlib import Path
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "inference-net/ClipTagger-12b"

# Prompts copied from the model card (citation link in the post body).
SYSTEM_PROMPT = (
"You are an image annotation API trained to analyze YouTube video keyframes. "
"You will be given instructions on the output format, what to caption, and how to perform your job. "
"Follow those instructions. For descriptions and summaries, provide them directly and do not lead them with "
"'This image shows' or 'This keyframe displays...', just get right into the details."
)

USER_PROMPT = (
"You are an image annotation API trained to analyze YouTube video keyframes. "
"You must respond with a valid JSON object matching the exact structure below.\n\n"
"Your job is to extract detailed factual elements directly visible in the image. Do not speculate or interpret "
"artistic intent, camera focus, or composition. Do not include phrases like 'this appears to be', 'this looks like', "
"or anything about the image itself. Describe what is physically present in the frame, and nothing more.\n\n"
"Return JSON in this structure:\n\n"
"{\n"
"    \"description\": \"A detailed, factual account of what is visibly happening (4 sentences max). Only mention concrete elements or actions that are clearly shown. Do not include anything about how the image is styled, shot, or composed. Do not lead the description with something like 'This image shows' or 'this keyframe is...', just get right into the details.\",\n"
"    \"objects\": [\"object1 with relevant visual details\", \"object2 with relevant visual details\", ...],\n"
"    \"actions\": [\"action1 with participants and context\", \"action2 with participants and context\", ...],\n"
"    \"environment\": \"Detailed factual description of the setting and atmosphere based on visible cues (e.g., interior of a classroom with fluorescent lighting, or outdoor forest path with snow-covered trees).\",\n"
"    \"content_type\": \"The type of content it is, e.g. 'real-world footage', 'video game', 'animation', 'cartoon', 'CGI', 'VTuber', etc.\",\n"
"    \"specific_style\": \"Specific genre, aesthetic, or platform style (e.g., anime, 3D animation, mobile gameplay, vlog, tutorial, news broadcast, etc.)\",\n"
"    \"production_quality\": \"Visible production level: e.g., 'professional studio', 'amateur handheld', 'webcam recording', 'TV broadcast', etc.\",\n"
"    \"summary\": \"One clear, comprehensive sentence summarizing the visual content of the frame. Like the description, get right to the point.\",\n"
"    \"logos\": [\"logo1 with visual description\", \"logo2 with visual description\", ...]\n"
"}\n\n"
"Rules:\n"
"- Be specific and literal. Focus on what is explicitly visible.\n"
"- Do NOT include interpretations of emotion, mood, or narrative unless it's visually explicit.\n"
"- No artistic or cinematic analysis.\n"
"- Always include the language of any text in the image if present as an object, e.g. \"English text\", \"Japanese text\", \"Russian text\", etc.\n"
"- Maximum 10 objects and 5 actions.\n"
"- Return an empty array for 'logos' if none are present.\n"
"- Always output strictly valid JSON with proper escaping.\n"
"- Output only the JSON, no extra text or explanation."
)

def load_model(device: str = "cuda"):
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
return processor, model

def build_prompt(processor):
# Include an image placeholder so the template has 1 image token.
user_content = [
{"type": "image"},
{"type": "text", "text": USER_PROMPT},
]
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_content},
]
if hasattr(processor, "apply_chat_template"):
return processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Fallback: include a literal <image> token expected by many VLM chat templates.
return f"<system>\n{SYSTEM_PROMPT}\n</system>\n<user>\n<image>\n{USER_PROMPT}\n</user>\n"

def annotate_image(image_path: str, max_new_tokens: int = 1000, temperature: float = 0.1):
device = "cuda" if torch.cuda.is_available() else "cpu"
processor, model = load_model(device)
prompt = build_prompt(processor)
image = Image.open(image_path).convert("RGB")

inputs = processor(images=image, text=prompt, return_tensors="pt")
inputs = {k: (v.to(model.device) if hasattr(v, "to") else v) for k, v in inputs.items()}

with torch.inference_mode():
generated = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=(temperature > 0),
temperature=temperature,
)
text = processor.batch_decode(generated, skip_special_tokens=True)[0]

# Try to parse JSON; if the model outputs surrounding text, trim to JSON braces.
try:
start = text.index("{")
end = text.rindex("}") + 1
parsed = json.loads(text[start:end])
print(json.dumps(parsed, ensure_ascii=False, indent=2))
except Exception:
# Fallback: print raw output to help debugging
print(text)

if __name__ == "__main__":
import sys
img = sys.argv[1] if len(sys.argv) > 1 else "frame.jpg"
annotate_image(img)

The prompts have been taken from the model card, see required prompts for full reference.

Then run it:

python3 run_cliptagger.py frame.jpg

You should see the following kind of output:

{
"actions": [],
"content_type": "real-world footage",
"description": "A small village with white buildings is situated in a valley between two large, green mountains. A paved road winds through the landscape, and a small river flows through the bottom of the valley. The peaks of the mountains are obscured by thick, white fog. A thin waterfall is visible on the side of the mountain on the left.",
"environment": "A remote, rural mountain valley with lush green vegetation. The atmosphere is cool and damp, indicated by the thick fog covering the mountain peaks and the overcast sky.",
"logos": [],
"objects": [
"Green mountains",
"White buildings",
"Paved road",
"River",
"Fog",
"Waterfall"
],
"production_quality": "professional studio",
"specific_style": "Landscape photography",
"summary": "A small village sits in a foggy, green mountain valley with a winding road and a river."
}

Ideas for production use

For batching, extract frames at a steady cadence and pre-resize to keep images under about 1 MB, then run annotations in parallel.

For scale, add an HTTP/gRPC front end, queue requests, and use GPU workers that micro-batch to maximize utilization. Track latency percentiles, tokens/sec, batch sizes, and GPU utilization.

Engines: start with PyTorch for simplicity. Switch to TensorRT-LLM for FP8 throughput where supported. If you prefer not to operate GPUs, use a managed API (e.g., inference.net).

Conclusion

That’s it - run the quickstart on your frames and iterate on the JSON schema for your task. Feedback welcome!

And finally, thanks to the model authors.

TL;DR

Requirements

TL;DR

Requirements

Quickstart (PyTorch, BF16, single GPU)

Ideas for production use

Conclusion

Similar Posts