Google engineer Rahman Lavaee today announced their work on a prototype software implementation to automatically insert optimal code prefetches into binaries for faster performance, especially for the latest Intel Granite Rapids and AMD Turin processors with new prefetching instructions.
This automatic code prefetch insertion prototype is built atop LLVM’s Propeller framework that was originally started by Google as well. Propeller paired with the likes of AutoFDO have proven very useful for tapping additional performance out of binaries.
![Intel Granite Rapids …
Google engineer Rahman Lavaee today announced their work on a prototype software implementation to automatically insert optimal code prefetches into binaries for faster performance, especially for the latest Intel Granite Rapids and AMD Turin processors with new prefetching instructions.
This automatic code prefetch insertion prototype is built atop LLVM’s Propeller framework that was originally started by Google as well. Propeller paired with the likes of AutoFDO have proven very useful for tapping additional performance out of binaries.
This new work is leveraging Propeller to insert optimal code prefetches into binaries. In turn Google engineers found the software leading to a reduction in CPU front-end stalls and an overall improvement in performance for their unnammed internal workloads running on Intel Xeon 6 Granite Rapids.
“We have developed a prototype that leverages Propeller to insert optimal code prefetches into binaries. This innovation is particularly relevant given that new architectures from Intel (GNR) and AMD (Turin) now support software-based code prefetching (PREFETCHIT0/1), a capability that Arm has supported for even longer (PRFM) . Our preliminary results demonstrate a reduction in frontend stalls and an overall improvement for an internal workload running on Intel GNR.
Our current framework requires an extra round of hardware profiles on top of the Propeller-optimized binaries. The profile is used to guide the target and injection site selection. Prefetches must be inserted judiciously as over-prefetching may increase the instruction working set. We have seen improvements from injecting ~10k prefetches. About ~80% of prefetches are placed in .text.hot while the rest are in .text. Similarly, 90% of prefetches target .text.hot while the remaining 10% target code in .text.“
Those wanting to learn more about this prototype work can see this LLVM Discourse thread where the Google engineers are seeking a request for comments on the idea and approach. It will be interesting to see where this leads and the resulting performance impact on public workloads once the code is published.