Abstract
Modern processors implement a decoupled front-end, often using a form of Fetch Directed Instruction Prefetching (FDIP), to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU’s accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1-I cache). As contemporary data center applications become more complex, their code footprints also grow, resulting in a high number of Branch Target Buffer (BTB) misses. These BTB missing branches typically have previously been decoded and placed in the BTB, but have since been evicted, leading to BTB misses now. FDIP can alleviate L1-I cache misses, but its reliance on the BPU’s tracking structures means that when it encounters a BTB mi…
Abstract
Modern processors implement a decoupled front-end, often using a form of Fetch Directed Instruction Prefetching (FDIP), to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU’s accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1-I cache). As contemporary data center applications become more complex, their code footprints also grow, resulting in a high number of Branch Target Buffer (BTB) misses. These BTB missing branches typically have previously been decoded and placed in the BTB, but have since been evicted, leading to BTB misses now. FDIP can alleviate L1-I cache misses, but its reliance on the BPU’s tracking structures means that when it encounters a BTB miss, the BPU may not identify the current instruction as a branch to FDIP. This can prevent FDIP from prefetching or cause it to speculate down the wrong path, further polluting the L1-I cache.
We observe that the vast majority, 75%, of BTB-missing, unidentified branches are actually present in instruction cache lines that FDIP has previously fetched. Nevertheless, these missing branches have not yet been decoded and inserted into the BTB. This is because the instruction line is decoded from an entry point (which is the target of the previous taken branch) till an exit point (taken branch). We call branch instructions present in the ignored portion of the cache line ’‘Shadow Branches.’’ Here we present Skia, a novel shadow branch decoding technique that identifies and decodes unused bytes in cache lines fetched by FDIP, inserting them into a Shadow Branch Buffer (SBB). The SBB is accessed in parallel with the BTB, allowing FDIP to speculate despite a BTB miss.
With a minimal storage state of 12.25KB, Skia delivers a geomean speedup of ~5.7% over an 8K-entry BTB (78KB) and ~2% versus adding an equal amount of state to the BTB, across 16 front-end bound applications. Since many branches stored in the SBB are distinct compared to those in a similarly sized BTB, we consistently observe greater performance gains with Skia across all examined sizes until saturation.
Formats available
You can view the full content in the following formats:
References
[1]
Apache cassandra. https://cassandra.apache.org.
[2]
Apache kafka. https://kafka.apache.org.
[3]
Apache tomcat. https://tomcat.apache.org.
[4]
Browserbench. https://browserbench.org.
[5]
Dotty scala compiler. https://github.com/lampepfl/dotty.
[6]
Intel vtune profiler. https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html#gs.hk7unc.
[7]
Postgresql. https://www.postgresql.org.
[8]
Speedometer 2.0. https://browserbench.org/Speedometer2.0.
[9]
Tpc-c. http://www.tpc.org/tpcc.
[10]
Twitter finagle. https://twitter.github.io/finagle.
[11]
Verilator. https://www.veripool.org/wiki/verilator.
[12]
Wikichip. https://en.wikichip.org/wiki/intel/microarchitectures/golden_cove.
[13]
Ycsb. https://github.com/brianfrankcooper/YCSB.
[14]
Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou, Nathan Pemberton, Paul Rigge, Colin Schmidt, John Wright, Jerry Zhao, Yakun Sophia Shao, Krste Asanović, and Borivoje Nikolić. Chipyard: Integrated design, simulation, and implementation framework for custom socs. IEEE Micro, 40(4):10–21, 2020.
[15]
Ali Ansari, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. Divide and conquer frontend bottleneck. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 65–78. IEEE, 2020.
[16]
Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers. In Proceedings of the 46th International Symposium on Computer Architecture, pages 462–473, 2019.
[17]
Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. Cacti 7: New tools for interconnect exploration in innovative off-chip memories. ACM Trans. Archit. Code Optim., 14(2), jun 2017.
[18]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, aug 2011.
[19]
Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The dacapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA ’06, page 169–190, New York, NY, USA, 2006. Association for Computing Machinery.
[20]
Gino Chacon, Nathan Gober, Krishnendra Nathella, Paul V. Gratz, and Daniel A. Jiménez. A characterization of the effects of software instruction prefetching on an aggressive front-end. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 61–70, 2023.
[21]
Thomas M Conte, Kishore N Menezes, Patrick M Mills, and Burzin A Patel. Optimization of instruction fetch mechanisms for high issue rates. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 333–344, 1995.
[22]
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases. Proc. VLDB Endow., 7(4):277–288, dec 2013.
[23]
Bhargav Reddy Godala, Sankara Prasad Ramesh, Gilles A Pokam, Jared Stark, Andre Seznec, Dean Tullsen, and David I August. PDIP: Priority directed instruction prefetching. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 846–861. ACM, 2024.
[24]
Brian Grayson, Jeff Rupley, Gerald Zuraski Zuraski, Eric Quinnell, Daniel A. Jiménez, Tarun Nakra, Paul Kitchin, Ryan Hensley, Edward Brekelbaum, Vikas Sinha, and Ankit Ghiya. Evolution of the samsung exynos cpu microarchitecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 40–51, 2020.
[25]
Yasuo Ishii, Jaekyu Lee, Krishnendra Nathella, and Dam Sunwoo. Rebasing instruction prefetching: An industry perspective. IEEE Computer Architecture Letters, 19(2):147–150, 2020.
[26]
Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 158–169, 2015.
[27]
Cansu Kaynak, Boris Grot, and Babak Falsafi. Confluence: unified instruction supply for scale-out servers. In Proceedings of the 48th International Symposium on Microarchitecture, pages 166–177, 2015.
[28]
Tanvir Ahmed Khan, Nathan Brown, Akshitha Sriraman, Niranjan K Soundararajan, Rakesh Kumar, Joseph Devietti, Sreenivas Subramoney, Gilles A Pokam, Heiner Litz, and Baris Kasikci. Twig: Profile-guided btb prefetching for data center applications. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 816–829, 2021.
[29]
Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. I-spy: Context-driven conditional instruction prefetching with coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 146–159. IEEE, 2020.
[30]
Tanvir Ahmed Khan, Muhammed Ugur, Krishnendra Nathella, Dam Sunwoo, Heiner Litz, Daniel A Jiménez, and Baris Kasikci. Whisper: Profile-guided branch misprediction elimination for data center applications. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 19–34. IEEE, 2022.
[31]
Tanvir Ahmed Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. Ripple: Profile-guided instruction cache replacement for data center applications. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 734–747, 2021.
[32]
Aasheesh Kolli, Ali Saidi, and Thomas FWenisch. Rdip: Return-addressstack directed instruction prefetching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 260–271, 2013.
[33]
Rakesh Kumar, Boris Grot, and Vijay Nagarajan. Blasting through the front-end bottleneck with shotgun. ACM SIGPLAN Notices, 53(2):30–42, 2018.
[34]
Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. Boomerang: A metadata-free architecture for control flow delivery. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 493–504. IEEE, 2017.
[35]
Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. Bolt: A practical binary optimizer for data centers and beyond. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, page 2–14. IEEE Press, 2019.
[36]
Andrea Pellegrini, Nigel Stephens, Magnus Bruce, Yasuo Ishii, Joseph Pusdesris, Abhishek Raja, Chris Abernathy, Jinson Koppanalil, Tushar Ringe, Ashok Tummala, Jamshed Jalal, Mark Werkheiser, and Anitha Kona. The arm neoverse n1 platform: Building blocks for the next-gen cloud-to-edge infrastructure soc. IEEE Micro, 40(2):53–62, 2020.
[37]
Jim Pierce and Trevor Mudge. Wrong-path instruction prefetching. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pages 165–175. IEEE, 1996.
[38]
Aleksandar Prokopec, Andrea Rosà, David Leopoldseder, Gilles Duboscq, Petr T?ma, Martin Studener, Lubomír Bulej, Yudi Zheng, Alex Villazón, Doug Simon, Thomas Würthinger, and Walter Binder. Renaissance: Benchmarking suite for parallel applications on the jvm. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, page 31–47, New York, NY, USA, 2019. Association for Computing Machinery.
[39]
G. Reinman, B. Calder, and T. Austin. Fetch directed instruction prefetching. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pages 16–27, 1999.
[40]
Alberto Ros and Alexandra Jimborean. Wrong-path-aware entangling instruction prefetcher. IEEE Transactions on Computers, 2023.
[41]
J Rupley. Samsung exynos m3 processor. IEEE Hot Chips, 30, 2018.
[42]
David Schall, Artemiy Margaritov, Dmitrii Ustiugov, Andreas Sandberg, and Boris Grot. Lukewarm serverless functions: characterization and optimization. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 757–770, 2022.
[43]
André Seznec. A 64-kbytes ittage indirect branch predictor. In JWAC-2: Championship Branch Prediction, 2011.
[44]
André Seznec. Tage-sc-l branch predictors again. In 5th JILP Workshop on Computer Architecture Competitions (JWAC-5): Championship Branch Prediction (CBP-5), 2016.
[45]
Shixin Song, Tanvir Ahmed Khan, Sara Mahdizadeh Shahri, Akshitha Sriraman, Niranjan K Soundararajan, Sreenivas Subramoney, Daniel A Jiménez, Heiner Litz, and Baris Kasikci. Thermometer: profile-guided btb replacement for data center applications. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 742–756, 2022.
[46]
David Suggs, Mahesh Subramony, and Dan Bouvier. The amd “zen 2” processor. IEEE Micro, 40(2):45–52, 2020.