TL;DR

The inference-net/ClipTagger-12b is a Gemma-3-12B based VLM with an Apache-2.0 license. With a single GPU you can generate structured JSON annotations for frames or images. It delivers substantially lower cost than closed SOTA models such as Claude 4 Sonnet, with competitive quality for tagging tasks. Details and benchmarks in inference.net blog post.

Requirements

  • GPU
  • CUDA 12.x runtime
  • Python 3.10–3.12
  • Disk: ~20–30 GB free (model + deps + cache)

Optional but recommended: ffmpeg for extracting video frames.

GPU note. Per the model card, ClipTagger-12B targets FP8-optimized GPUs (RTX 40-series, H10…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help