Some Fun Videos on Optimizing NES Code

Just some quick links and commentary this week as I shift gears into my next project. The Displaced Gamers YouTube channel has a number of code deep dives into NES games over the years, and two recent pairs of videos have been particularly relevant to my own writing on 6502 coding. These have involved Ghosts ‘n Goblins (Part 1, Part 2) and Metroid (Part 1, Part 2). In each case, Part 1 tries to examine the general structure of frame updates and then determine where one might be able to optimize code to remove lag or stuttering animations. Part 2 then…

One thing that comes up in both is that procedure calls in inner loops are extremely expensive. I discovered that this stays true even into the 16-bit era when optimizing some Motorola code on the Genesis; it’s at least as true on the 6502. Another thing that comes up is that pointers can be very expensive on the 6502. Both Micronics and Nintendo found themselves using 16-bit pointer-based solutions in several places where a more dedicated 8-bit indexing system would have been more effective. When I talked about data processing on the 6502, I set out two principles for efficient 6502 data processing:

Inner loop variables in 6502 code should generally be held in index registers, with memory taking a distant second place and the accumulator being most unsuitable.
6502 code does not want to work with pointers. It wants to work with arrays, and if the process involves accessing multiple arrays at once, it wants to work with corresponding elements in each of those arrays.

DisplacedGamers here suggests that Micronics was optimizing more for programmer time and early releases rather than most efficient use of the hardware, and I do think this shows. This does, however, sell them a bit short; Ghosts ‘n Goblins was doing multidirectional scrolling on the NES in 1986, and really acceptable diagonal scrolling normally doesn’t show up until much, much later in the console’s lifespan. So despite quite a lot of glitchiness in the animation, it’s worth noting that they were purchasing what were at the time some very rare effects. (The only game from its era that I can think of that really manages this well is The Legend of Kage and it’s very obvious that it built its whole engine around just that effect.)

The part of the display engine I found the most baffling, though, was the sprite reversal engine. The overall display tried to run at a reasonably consistent 20 FPS, but sprite priorities were reversed every other frame to automatically flicker sprites when there are more than 8 on a scanline. This is a fairly reasonable approach, normally, but here it results in a bizarre 3:2 polyrhythm between sprite updates and physics updates that makes me wonder if they had originally intended to use these spaces as some kind of double-buffering for sprite data. If so, flickering sprites at 20 FPS would surely have been unacceptable, even at the time, so it’s not too surprising to me that something like this would be patched in.

The remnants of that potentially-a-double-buffering system, though, meant that the code was using pointers nearly everywhere despite restricting itself to a single 256-byte block that didn’t need pointers in the first place. The optimizations go into simplifying that, but even without that full simplification into a single array, the 6502’s “indirect indexed” mode would have let them stick with rapid iteration using Y as the index register had they built it around that.

When doing the sprite reversal, the code has to work with both copies of the RAM copies of the Sprite Attribute Tables; this is an array of 64 4-byte structures. If you look at DG’s optimized replacement code, he’s keeping the X and Y indices the same during the iteration until the end, and accessing each byte in the structure by adding a value from 0 to 3 to its base pointer. Once the whole structure is copied is done, he can shift the pointer by 4 in each direction, and it turned out that bouncing it through the accumulator first was faster than just repeatedly executing DEX and INY instructions.

Metroid‘s game objects were managed in a very similar way to this, but its objects were 16 bytes wide. It relied on subroutines (again, quite expensive given the time constraints) to adjust the X and Y registers as needed. The video there outlines an alternative approach, which effectively divides up memory by field instead of by object. This “structure of arrays” approach is much more pleasant on the 6502, and I’ve talked about it explicitly twice now: once in passing when porting Simulated Evolution to efficiently represent the simulation state, and once more formally when discussing how one would implement linked list data structures on the platform. The potential gains are quite significant.

Another fun thing that came up was some “animation glitches” that turned out to be nothing of the sort, but which instead were a consequence of scrolling the screen at 1.125 pixels per frame while using fixed point math. I talked about the benefits of fixed point here, but this is a caveat I hadn’t really considered at the time. 9/8 is a pretty rough ratio even if the general platforming control ultimately requires it.

Definitely a fun time all around, and I’ve enjoyed the “Behind the Code” series for a long time, too. Worth checking out.

With luck, I’ll be back next week with a new project.

← Author, Director, Performer, Audience

Post navigation

Similar Posts