Introducing GIST: The next stage in smart sampling (opens in new tab)

research.google·11w·Open original (opens in new tab)

Modern machine learning (ML) has unlocked unprecedented performance at the cost of having to process increasingly massive and complex datasets. From large language models (LLMs) to computer vision systems, there’s a common challenge: handling a massive amount of data that’s expensive to process.

This necessitates subset selection — the task of choosing a smaller, representative group of data points from the full dataset for the typical training task (not the fine-tuning). The question is then: how can we be sure this subset contains enough information to train an accurate model?

At NeurIPS 2025, we introduced Greedy Independent Set Thresholding (GIST), a novel algorithm that helps solve this issue by balancing data “diversity” (ensuring the selected data is not redundant) and data “utility“ (data that is relevant and useful for the task). GIST not only outperforms state-of-the-art benchmarks tasks, such as image classification, but it does so with a mathematical guarantee about its solution quality.

The conflict: Why smart sampling is hard

When selecting a subset of data, researchers must balance two often conflicting objectives: diversity and utility. Enforcing data diversity ensures the selected points aren’t redundant. Utility measures the overall usefulness or informational value of the selected subset.

For diversity, we focus on maximizing the minimum distance (typically in embedding space) between any two selected data points, also known as max-min diversity. If you choose two data points that are very similar (e.g., two almost identical pictures of a golden retriever), your diversity is low. Max-min diversity forces you to select points that are all as far apart from each other as possible, minimizing redundancy and ensuring broad coverage of the data landscape. For utility, we focus on the class of monotone submodular functions, which aim to maximize the total unique information covered by the subset.

The difficulty lies in combining these two goals. A pure max-min strategy might select diverse but ultimately irrelevant data points, while a pure utility strategy might select a tight, highly relevant cluster of redundant points. Finding a subset that is both maximally spread out and maximally informative is a complex combinatorial problem that is known to be NP-hard, meaning that no algorithm can find the best solution efficiently, especially for massive datasets.

This inherent conflict requires a clever approximation strategy.

How GIST works

Loading more...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help