ClipTagger-12B VLM: Frame Captioning Tutorial
dev.to·1d·
Discuss: DEV
Flag this post

TL;DR

The inference-net/ClipTagger-12b is a Gemma-3-12B based VLM with an Apache-2.0 license. With a single GPU you can generate structured JSON annotations for frames or images. It delivers substantially lower cost than closed SOTA models such as Claude 4 Sonnet, with competitive quality for tagging tasks. Details and benchmarks in inference.net blog post.

Requirements

  • GPU
  • CUDA 12.x runtime
  • Python 3.10–3.12
  • Disk: ~20–30 GB free (model + deps + cache)

Optional but recommended: ffmpeg for extracting video frames.

GPU note. Per the model card, ClipTagger-12B targets FP8-optimized GPUs (RTX 40-series, H10…

Similar Posts

Loading similar posts...