Visualizing Attention in Text-to-Image Diffusion (opens in new tab)

In the previous post we took a text-to-image diffusion model apart and looked at its three working parts: a CLIP text encoder that turns…