๐ Advent of Small ML: Day 2 : Teaching a VLM to reason about charts with Unsupervised GRPO๐
a big use case for VLMs is parsing chart data for Q&A. CharXiv from @zwcolin is a great recent benchmark for this, but I had a question: Can we do this in an unsupervised way?
If we donโt need labeled Q/A pairs for every chart, we can leverage data much more cheaply.
The inspiration came from CycleGAN and the idea of using a numerical loss as a proxy for how "good" the text generated by the VLM actually is. (Big inspo here is @rosmine_bโs SVG work - go check it out).
The Experiment: I set up a loop to treat the VLM like an autoencoder:
1. Take a chart image.
2. Prompt the VLM to describe it.
3. Feed that description into an image generator (Flux Schnell).
4. Measure the cosโฆ
๐ Advent of Small ML: Day 2 : Teaching a VLM to reason about charts with Unsupervised GRPO๐
a big use case for VLMs is parsing chart data for Q&A. CharXiv from @zwcolin is a great recent benchmark for this, but I had a question: Can we do this in an unsupervised way?
If we donโt need labeled Q/A pairs for every chart, we can leverage data much more cheaply.
The inspiration came from CycleGAN and the idea of using a numerical loss as a proxy for how "good" the text generated by the VLM actually is. (Big inspo here is @rosmine_bโs SVG work - go check it out).
The Experiment: I set up a loop to treat the VLM like an autoencoder:
1. Take a chart image.
2. Prompt the VLM to describe it.
3. Feed that description into an image generator (Flux Schnell).
4. Measure the cosine similarity between the regenerated image and the original (using DINO)
This similarity score becomes the reward signal for GRPO. The logic: to accurately recreate the image, the model must extract the most salient features in its description.
The methods: I used Qwen 2.5 3B and DINOv2 for the embeddings (to capture semantic info, not just pixels).
Results for the Proxy Task: The model consistently improved its cosine similarity scores.
Results for Transfer Learning : Despite seeing zero labeled questions during training, this transferred to CharXiv reasoning questions, showing an ~7% improvement in pass@1 at the peak.
Itโs a small experiment with a small model, but I think the result is really cool: the model got better at reasoning without seeing a single reasoning label.
Iโm really interested in exploring more of these CycleGAN-esque / "LLM as Autoencoder" domains to escape the need for labeled data.
Repo + Plots in the comments.
Results: For the evaluation set - the cosine similarity between the regenerated image (from the LLM prompt send to flux-schnell) - it is definitely learning!