Modern machine learning (ML) has unlocked unprecedented performance at the cost of having to process increasingly massive and complex datasets. From large language models (LLMs) to computer vision systems, there’s a common challenge: handling a massive amount of data that’s expensive to process.
This necessitates subset selection — the task of choosing a smaller, representative group of data points from the full dataset for the typical training task (not the fine-tuning). The question is then: how can we be sure this subset contains enough information to train an accurate model?
At NeurIPS 2025, we introduced Greedy Independent Set Thresholding (GIST), a novel algorithm that helps solve this issue by balancing data “diversity” (ensuring the selected data is not redundant) and data “utility“ (data that is relevant and useful for the task). GIST not only outperforms state-of-the-art benchmarks tasks, such as image classification, but it does so with a mathematical guarantee about its solution quality.
The conflict: Why smart sampling is hard
When selecting a subset of data, researchers must balance two often conflicting objectives: diversity and utility. Enforcing data diversity ensures the selected points aren’t redundant. Utility measures the overall usefulness or informational value of the selected subset.
For diversity, we focus on maximizing the minimum distance (typically in embedding space) between any two selected data points, also known as max-min diversity. If you choose two data points that are very similar (e.g., two almost identical pictures of a golden retriever), your diversity is low. Max-min diversity forces you to select points that are all as far apart from each other as possible, minimizing redundancy and ensuring broad coverage of the data landscape. For utility, we focus on the class of monotone submodular functions, which aim to maximize the total unique information covered by the subset.
The difficulty lies in combining these two goals. A pure max-min strategy might select diverse but ultimately irrelevant data points, while a pure utility strategy might select a tight, highly relevant cluster of redundant points. Finding a subset that is both maximally spread out and maximally informative is a complex combinatorial problem that is known to be NP-hard, meaning that no algorithm can find the best solution efficiently, especially for massive datasets.
This inherent conflict requires a clever approximation strategy.