Subset Sampling over Joins
arxiv.org·2h
🔥DataFusion
Preview
Report Post

Title:Subset Sampling over Joins

View PDF HTML (experimental)

Abstract:Subset sampling (also known as Poisson sampling), where the decision to include any specific element in the sample is made independently of all others, is a fundamental primitive in data analytics, enabling efficient approximation by processing representative subsets rather than massive datasets. While sampling from explicit lists is well-understood, modern applications – such as machine learning over relational data – often require sampling from a set defined implicitly by a relational join. In this paper, we study the problem of \emph{subset sampling over joins}: drawing a random subset from the join results, where each join re…

Similar Posts

Loading similar posts...