Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann SIGMOD’14

The giant upon whose shoulders this paper rests is Volcano. Parallelism in Volcano is achieved through a proper separation of concerns. Volcano contains many database operators, most of which are blissfully unaware of parallelism. A handful of operators in a query plan exist only to enable parallelism (for example, an operator could implement pipeline parallelism, or partition data between threads).

Generally speaking, an elegant separation of concerns is good for performance. However, the thesis of morsel-driven parallelism is …

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann SIGMOD’14

Generally speaking, an elegant separation of concerns is good for performance. However, the thesis of morsel-driven parallelism is that this is not true. Deeply integrating the notion of parallel execution into each operator is the way to go for OLAP.

The system described in this paper is named HyPer. Fig. 2 below illustrates how HyPer decomposes a single query into three pipelines:

The leftmost pipeline scans the input relation T and applies a filter to it. The middle pipeline scans and filters the input relation S. The rightmost pipeline scans and filters R, joins the result with A, and finally joins the result of the first join with B. A system like Volcano would be tempted to scan T and S in parallel. Not so with HyPer: the pipelines which make up a query plan are executed serially.

Each relation (both inputs, and temporary data) are divided into morsels. A morsel is a group of ~100,000 tuples. Each morsel resides on a single NUMA node (indicated by colors in the figures). Fig. 3 illustrates how HyPer uses morsel-level parallelism to implement the left pipeline (scan and filter T):

The pipeline that processes T operates in two phases. In the first phase, a pool of threads (each pinned to a core) collectively process all morsels in T. When a thread comes up for air, it grabs another morsel of input. Threads are assigned to a NUMA node, and threads prefer to process morsels assigned to the same NUMA node. If no morsels of matching color are available, then a thread will reluctantly process a morsel from another NUMA node. During the first phase, each thread writes filtered tuples into a thread-local storage area.

Once all input tuples have been processed, a hash table is created (conveniently, the hash table can be sized well because the number of tuples that must be stored is known). This hash table is global (i.e., not NUMA aware). In the second phase, tuples are inserted into the hash table.

The HyPer hash table is designed to allow lock-free insertions, along with a fast path for the common case where a probe operation yields no hits. The hash table uses chaining, with very special pointers used to point to the next element in the chain. The lower 48 bits of each pointer in the hash table contains the memory address that is being pointed at, the upper 16 bits are a tag. Think of the tag as a 16-bit Bloom filter describing the set of elements in the sub-list that the pointer points to. When the hash table is probed, a hash of the join key from the probe tuple is used both to determine which chain to search in, and to stop the search early if no possible unexamined list element could contain the join key.

Because both the pointer and the tag are packed into 64 bits, a compare-and-swap instruction can be used to insert elements into the hash table without using an OS lock. If the hash table is large enough, then most executions of the compare-and-swap instruction should succeed. Fig. 7 illustrates the hash table data structure and the insertion algorithm:

It is a bit odd that the design in this paper goes to such great lengths to avoid cross-socket (NUMA) reads from main memory, and yet the hash table is not NUMA aware. I think that the 16-bit tags are key here. If the set of head pointers for all buckets in the hash table is small enough to fit into an L2 cache, then this data can be efficiently replicated into all L2 caches. As long as the hash table hit rate is low enough, the number of cross-socket memory accesses during probe operations will be low.

Fig. 11 shows throughput for all 22 TPC-H queries, for 4 different configurations:

It is interesting how NUMA-awareness matters a lot for some queries, but not all.

Fig. 10 shows the author’s NUMA model:

What is interesting here is that a similar setup exists within each socket. Say a socket contains 32 cores, and 4 memory controllers. Those cores and memory controllers will be laid out in a grid, with a network connecting them. I wonder if there is performance to be gained by paying attention to the intra-core layout (e.g., cores on the left side of the chip should only access memory controllers on the left side of the chip).

No posts

Similar Posts