UPP: Universal Predicate Pushdown to Smart Storage Ipoom Jeong, Jinghan Huang, Chuxuan Hu, Dohyun Park, Jaeyoung Kang, Nam Sung Kim, and Yongjoo Park ISCA’25
Working on hardware acceleration requires a healthy dose of honesty. If you try hard enough, you can find clever ways to accelerate a given application. Once that point is reached, it is helpful to step back and ask yourself: “are these ideas generally applicable to many hardware architectures, or only to the one I am targeting?”
This paper describes techniques for high performance filtering of OLAP data, and a FPGA implementation. I wonder if these ideas would also work well on other chips.
This paper rests on two assumptions:
Inputs are relatively static, which means th…
UPP: Universal Predicate Pushdown to Smart Storage Ipoom Jeong, Jinghan Huang, Chuxuan Hu, Dohyun Park, Jaeyoung Kang, Nam Sung Kim, and Yongjoo Park ISCA’25
Working on hardware acceleration requires a healthy dose of honesty. If you try hard enough, you can find clever ways to accelerate a given application. Once that point is reached, it is helpful to step back and ask yourself: “are these ideas generally applicable to many hardware architectures, or only to the one I am targeting?”
This paper describes techniques for high performance filtering of OLAP data, and a FPGA implementation. I wonder if these ideas would also work well on other chips.
This paper rests on two assumptions:
Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries 1.
Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not)
A preprocessing step generates a 256-bit row vector (RV) associated with each row. These bits are partitioned among all columns (e.g., if there are 8 columns in the relation, then each column is represented with 32 bits per row). When a query is run, the relevant filters from the query are converted into a set of 256-bit query vectors (QVs) and simple instructions which perform logical operations between the row vectors and query vectors. The result of those instructions is a single bit per row which determines if the row can be safely removed.
Numerical expressions (e.g., l_quantity >= 20 and l_quantity <= 30) are supported for monotone functions. During the preprocessing step, the lower and upper bounds of each column are computed. This space is divided into a fixed number of buckets. For each value in a column, the associated bucket index is computed, and the associated bit in the row vector is set to 1. Hashing is used to handle the case where there are more buckets than bits allocated for a given column.
When a query is executed, the software can determine the set of buckets which the query references. For example, say the filter expression is:
l_quantity >= 20 and l_quantity <= 30
and the buckets for l_quantity are:
[0, 10), [10, 20), [20, 30), [30, 40), [40, 50).
The query vector which selects rows which should not be filtered is (LSB first):
00110
To determine if a row should be filtered, compute the bitwise AND of the row vector and the query vector. If all bits of the result are zero, then the row can be removed.
To convert a string into a row vector, the paper proposes tokenizing the string, and then hashing each token (i.e., word) to determine which bit in the row vector to set. This means that multiple bits in a row vector can be set (one bit per word). Only tokens which appear frequently in the dataset are hashed, the rest are ignored.
A query expression like l_shipinstruct = 'DELIVER IN PERSON'is decomposed into three tokens, and each token is hashed, and the hash values determine which bits in the query vector are set. Rows are accepted if they have all 3 bits set. Note that this is best-effort filtering. For example, if a row contains the string 'PERSON DELIVER IN' in the l_shipinstructcolumn, that row will not be removed.
Table 3 shows FPGA resource usage numbers for an FPGA accelerator which executes queries by performing bitwise operations on row and query vectors, and compacting tables based on the generated bitmaps. These numbers seem pretty modest (i.e., good) to me. The authors argue that this implementation is small enough that it could be incorporated into a Smart SSD, allowing queries to be pushed down as far as possible.
Fig. 6 shows TPC-H performance results. Each pair of bars represents a particular query run on a baseline system, and on a system with filtering pushed down to a Smart SSD. Q21 doesn’t see a speedup because it is bound by join and aggregation, not filtering.
I wonder how much of this is overfitting to TPC-H. If you look at the TPC-H spec, a lot of string columns are generated by randomly sampling from a very small set of possible tokens. It would be great if the industry had a “held out test set” which could be used to evaluate OLAP performance on real-world yet hidden datasets which researchers could not directly see.
No posts