📽️Video Prediction arxiv.orgAcademic

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers (opens in new tab)

Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend o...

Read the original article