Last week’s adventures with the TI-99/4A’s Graphics Programming Language revolved around relying on features of the system firmware and language runtime to easily manage expensive or time-critical tasks. This week, I’ll attack the flip side of the question: how far can we go without leaving the GROMs, and how do we accelerate it when we need to?
Last year, when I was getting a handle on the TI hardware and trying to devise a sensible programming discipline for it, one of the programs I put together to do that was a test routine for the 16-bit Xorshift PRNG that I’ve relied on for years. I didn’t …
Last week’s adventures with the TI-99/4A’s Graphics Programming Language revolved around relying on features of the system firmware and language runtime to easily manage expensive or time-critical tasks. This week, I’ll attack the flip side of the question: how far can we go without leaving the GROMs, and how do we accelerate it when we need to?
Last year, when I was getting a handle on the TI hardware and trying to devise a sensible programming discipline for it, one of the programs I put together to do that was a test routine for the 16-bit Xorshift PRNG that I’ve relied on for years. I didn’t really go into it at the time, though, since the code was sloppy and more just a proof-of-concept. So this week, we’ll use it as a platform to explore the boundaries between GROM and ROM, between the GPL bytecode and native machine language.
- We will generalize our original programming discipline to include the GROM.
- We will reimplement the original test program to fit purely in GROM, and then test the speed of the results.
- We will accelerate our program by reimplementing parts of it in native code, and calling out to it from the GROM bytecode.
- Last week’s music routine in pure-GROM code was kind of obnoxious—we had to copy some of our data tables into VRAM to actually use them, and we also had to restructure a bunch of code that otherwise would be unrelated. Hybrid ROM/GROM code should be able to help us here, too, so we’ll bring in the same song and play it back more conveniently than it we did last time.
Expanding Our Design Discipline
Here’s the original discipline, mostly quoted from last year’s article:
- The program only owns the first 64 bytes of scratchpad RAM, from
>8300–>833F. This means none of our state will be interfered with by the firmware either at the base system level or the GPL bytecode interpreter level. BASIC would still get in our way, but we are not interoperating with BASIC here. - The Workspace Pointer is always
>8300at function boundaries. The only disjoint values we can set where we get the whole workspace are>8300and>8320, and since theBLWPoverwrites three entire registers on its own, it’s not worth the effort to swap between them with that mechanism. Just rely onLWPIas needed if a system call obliges us to alter the workspace location. - Variables are either local, global, or far. Local variables are only used by the current function. Global variables live in the workspace but every function has to agree to their purpose. Far variables include everything outside of the
>8300–>831Fworkspace. - Local variables start at R0 and count up. A function whose signature declares two local variables will destroy registers
R0andR1. One with four destroysR0throughR3. Arguments and return values are transferred as local variables much as they are in other register-based calling conventions. - Functions are treated as using the local variables of the functions they themselves call. If function
FOOuses two locals, and another functionBARuses two locals and then also uses two locals of its own, thenBARwill use five locals; its own, the two used byFOO, and a copied-out value ofR11that lets it preserve its own return address. Registers that don’t need to be preserved across function calls might potentially lower this count. - Global variables start at R15 and count down. Basically this keeps things consistent while offering as much scope as possible for consistent use of local variables. However, we also aren’t completely free here, because some instructions hardcode significance into particular registers. We’ll basically always have to skip
R11because theBLinstruction uses it to do function calls. If we are writing to be targeted by the cross-workspace function-call instructionBLWP, we also cannot useR13–R15because it makes use of all of them itself. Finally, if we are doing direct I/O with peripherals through the “CRU” subsystem, that class of instructions places special meaning onR12. If our program uses neither facility—as ours indeed do not—then these four variables are available for use as globals. - Far variables, conceptually, live in VRAM. This is probably a part that doesn’t contribute to drawing the screen, but that’s not guaranteed. Quite a few of my earlier projects have done things like treat the score as a string variable stored directly in the screen memory where it is drawn, or taken a framebuffer display showing a chart of some kind and used the view as the model. When I created a simple shooting gallery, I mapped regions of the Sprite Attribute Table into this region to edit them semi-directly.
- The
>8320–>833Fregion serves as a local cache for part of VRAM. This is how we do work with those far variables. Once we’re done with that particular chunk of memory, we can write it back until we need it again.
Most of this holds up very well as-is. We only need to alter or add commentary to a few rules.
- The workspace pointer must be
>83E0when receiving control from, or returning control to, the GROM. This means that we’ll need to be setting the workspace to>8300by hand more often than we otherwise would, and returning from a global entry point will look like making a syscall. - Routines in GPL and native code share the local variable space. While the GPL interpreter has a procedure call stack, it doesn’t have a local variable frame, so its functions participate in the local variable allocation protocol the same way native functions do.
- The GROM has freer access to far variables. GPL operands give you much freer access to data in VRAM, which means that it’s less necessary to set up the local cache in CPU RAM. It’s easy to do so, though—reading in or writing out the cache is a single
MOVEinstruction—so responsibility for the cache is not restricted to either side of the divide. - Far variables don’t have to use VRAM as an ultimate source of truth. If an application has fewer than 32 bytes of far variables, it can just keep them in the scratchpad and all code will access it identically.
Experiment 1: All-GROM
The test program opens almost identically to the Hello World program in the TI-99/4A Platform Guide:
GROM >6000
AORG >0000
DATA >AA01,0,0,MENU
DATA 0,0,0,0
MENU DATA 0,START
STRI "GROM XORSHIFT TEST"
START ALL >20 * CLEAR SCREEN
BACK >F4
DST >0900,@>834A
CALL >0018 * SET UPPERCASE CHARSET
DST >0B00,@>834A
CALL >004A * SET LOWERCASE CHARSET
ST >F4,V@>0380 * SET COLORS
MOVE 15,V@>0380,V@>0381
MOVE 7,V@>0968,V@>0969 * CHANGE DASH TO UNDERLINE
ST >FF,V@>0969
The main extension here is redefining the colors to be white-on-blue and adjusting the dash character to look more like a solid underline. That done, we print out the static headers that introduce our charts:
FMT
ROW 0
COL 0
HTEX 'FRAME COUNT:'
ROW 2
COL 4
HTEX 'XORSHIFT16 TEST SEQUENCE '
HTEX '------------------------'
ROW 6
COL 2
FEND
Now, 27 lines in, we finally have to actually start thinking about where our variables should go. This requires us to think about the entire program, not just the main function, so I’ll be skipping ahead a bit to sort this out:
- We have four bytes of global variables, holding the PRNG state. That corresponds to
R14andR15in our program’s workspace, so that’s>831C–>831F. - The RNG function itself also needs two bytes of local variables.
- We’ll also have a special function to write out 16-bit hex numbers; this routine will require three bytes of workspace.
- All of this means that the main function’s own locals will begin at
>8303.
Working within this discipline makes it really important to both have an iron grip on your call stack and to do your actual implementation from the bottom up. Explaining the program afterwards does still make more sense top down, though.
Now that we know where our locals go, we can implement our main loop. First, we need to seed the RNG with the value 1, then generate 64 random numbers and print them out on the screen with some spacing.
DST 1,@>8300
CALL RNGSEED
ST 64,@>8303
! CALL RNG
DST @>831E,@>8300
CALL PHEX
FMT
HTEX ' '
FEND
DEC @>8303
BR G@-!
Notice that the RNG function doesn’t have to actually return anything; the global variable at >831E is quite literally just as easy to access from the GROM as any other point in memory.
The display now built, we now repeatedly print the frame count to the screen as a 16-bit integer and loop until a key is pressed.
This finishes our program. We aren’t dealing with music yet; the changes necessary would look basically identical to where we ended last week.
Implementing the Subroutines
We have three subroutines. The first takes a 16-bit value in >8300 and replicates it across the RNG state:
A more generic version would insist that this value not be zero and make it something else if it was; we don’t have to trouble ourselves with that here.
The RNG routine itself is almost a direct translation of the two-address pseudocode that I wrote when first implementing this RNG for the ZX81. Only the “push” and “pop” operations are altered.
The PHEX function prints a 16-bit number as 4 digits of hexadecimal, and it does it by isolating each nybble of its argument in >8302 and handing it off to an internal worker function named !PDIGI:
The !PDIGI function itself starts by isolating the nybble and converting it to ASCII:
It then prints it out by treating the variable with the ASCII in it as a one-character string and passing that to the HSTR operator in screen-format mode.
That’s kind of awkward, but we do get our display!
It’s extremely slow, though. We have a noticeable pause on startup as the characters load, and the gap between the BACK command and the VRAM writes to the >0380 range leave it in a weird state where the screen has a bright cyan background but a blue border:
Worse, it takes over a second to print out this display, from the first letter written to the screen to the first timer value printed at the top. The native-code version that I had made back in my original tests took more like a tenth of a second. We were warned that relying on a bytecode interpreter instead of native code would be slower, but this is an order of magnitude slower for something that we want to do graphics code with.
That won’t do.
Experiment 2: Hybrid Code
Fortunately, GROMs are not all-or-nothing: the Graphics Programming Language includes an “eXecute Machine Language” instruction to call routines in the ROM. That’s right: using XML in your GPL code will speed it up. This platform is just an absolute nomenclature disaster here in 2026.
Happily, the instruction is really easy to use. We can have up to 48 machine language functions exposed to the GROM. The addresses of these functions are put into three 16-entry tables at >6010, >6030, and >7000 in the native-code part of our cartridge (which is mapped into the >6000–>7FFF range in CPU memory). We may then call a function with the instruction XML >xy, where x is 7, 8, or 9 to pick the first, second or third table, and y is a value from 0 to 15 to pick an element from that table. The first and second tables kind of run up against one another here, but the instruction doesn’t really notice that. For using the instruction to index places that are not your own cartridge ROM, the TI-99/4A Tech Pages offer a complete list of values for x.
Once you’re on the machine language side of things, your workspace pointer will be >83E0, your return address will be at R11 in that workspace, and interrupts will be disabled. We may move the workspace pointer around, but we must restore it before we return. Disabled interrupts works in our favor, because it gives us free access to the VRAM ports without any extra work on our part.
Our task here will be pushing the RNG and PHEX functions into native code. (RNGSEED is simple enough that it’s not worth the effort of porting it over.) That means our initial table looks like this:
The API is suprisingly similar: We’ll call RNG with the instruction XML >71, collecting the result from >831E like before, and PHEX with XML >72, with the argument passed in >8300 again like before.
The RNG function’s logic isn’t much changed from the GROM. It’s just about ten times faster:
The PHEX and !PDIGI functions have more work to do now. We’ll be writing to VRAM to output our characters, so not only do we need to do the work for VRAM access, we also need to interoperate properly with the FMT commands. FMT‘s cursor position is stored with the row in >837E and the column in >837F; we need to compute >4000 + (ROW * 8) + COL and then send it low byte first to the VDP command port at >8C02.
Simple enough, but it wrecks our GROM code. We’ve already used six bytes of local variables instead of the three the GROM edition did. I don’t really want to keep editing the GROM as I experiment with native code, so as part of this conversion I also eradicate all GROM use of local variables and relocate all its variables used into the >8320–>833F far variable region.
Isolating each nybble and calling !PDIGI is deceptively simple, looking very much like our GROM code but without the fallthrough:
MOV R0,R1
SRL R0,4
BL @!PDIGI
MOV R1,R0
BL @!PDIGI
SWPB R1
MOV R1,R0
SRL R0,4
BL @!PDIGI
MOV R1,R0
BL @!PDIGI
This would normally be a huge problem: we’re calling other functions but we never saved our link register! However, the link register that called us here is back in the GPL Workspace at >83E0 and is safe. We only need to manually preserve the link register if our call stack gets three deep here.
Our last task is to update the cursor on the way out:
AB @C04,@>837F * Advance CCOL
CB @>837F,@C20
JL !
SB @C20,@>837F
AB @C01,@>837E
! LWPI >83E0
B *R11
The !PDIGI function is roughly the same as the GROM original.
Finally, we need to add the constant tables that we used because the increment and immediate-mode instructions do not work on bytes.
The output is identical, but we’re doing it in more like a sixth of a second. Not as good as full-bore native code, but even with the logic that hands control between the native code’s direct VRAM access and the GROM’s use of the FMT display language, we’re far closer to fully-native speeds than we were with the pure GROM approach.
Not bad at all, if I may say so. But hybrid code also lets us attack headaches beyond execution speed.
Experiment 3: A Custom Interrupt Service Routine
One of the tasks we took on last week was to add a little soundtrack to the animation we’d built in pure-GROM code. It had some unpleasantness associated with it; we had to restructure all our code to manage a frame update loop that we otherwise didn’t need. We also found that we had to copy some of our data out of GROM into VRAM because the GROM bytecode language didn’t seem to let us dereference pointers into the GROM.
Native code can solve both of these issues. It’s more work to do things in native code because we have to manage all the cross-address-space logic ourselves, but that also means that we’re unrestricted in what we can ask for. Furthermore, the system frame-interrupt routine that managed the sound lists and sprite animation also includes a hook that lets us call out to a user-defined native machine code routine. We can put the music update code there, which lets the GROM side operate freely.
We’ll need a few new global variables to track the music players: R13 will hold the current pointer into the song, and R12 will hold the loop point. Once we reach the end of the song, the loop point will tell us how to reset it.
The GROM code only needs to know how to ask to set this up, so there’s only one new function we export: DOSONG, at entry point >73.
This code looks like it should be doable from GPL; all we’re doing is copying >8300 into >8318 and >831A then loading a constant into the interrupt service routine pointer at >83C4. I don’t trust the GROM to do this, though; the bytecode interpreter is fundamentally byte-oriented and it does not appear to run the interpreter with interrupts disabled. We could only have half of the address loaded at the time an interrupt hit, and since the value won’t be zero, we’ll jump off into some random memory location and everything will explode. The XML instruction guarantees that interrupts are disabled, and the MOV instruction is 16-bit too, so we’re doubly safe.
We haven’t really considered the interrupt handlers in our code yet, but that’s because it mostly does a good job of staying out of our way. The firmware interrupt handlers do their work with the workspace pointer at >83C0, well out of the way of everything we’ve worked with. If it sees that there’s a hook for a custom interrupt service routine, it sets the workspace pointer to >83E0 and then calls it with a branch-and-link instruction.
A Brief Moment of Panic
To repeat the previous point: the interrupt handler sets the workspace pointer to >83E0 and then uses a branch-and-link instruction. This puts the return address in register 11, and that means that if an interrupt hits while we’re processing an XML routine, it’s going to overwrite our own return address and we will never be returning into the GROM. This is why interrupts are always disabled when we enter XML routines; to protect us from this danger.
We’ll need to expand our memory usage discipline if we are going to encompass custom interrupts.
- All variables touched by an interrupt service routine are treated as globals. It definitely can’t touch our locals, and since interrupts can happen at any time, it shouldn’t rely on the far variables or their cache being in any particular state.
- If an
XMLroutine enables interrupts, it must preserve the 16-bit value at>83F6in a local and restore it after re-disabling interrupts. This will safely permit control to return the GROM even though the ISR and the routine share a return address location.
For our part, we’ll just keep not disabling interrupts at all; none of our machine code calls are expected to last more than a frame. The general principle that we need to leave the system as we found it, though, will continue to hold firm.
Implementing the Handler
Our interrupt service routine doesn’t have anything to do unless the sound list has been exhausted, so we can save ourselves a lot of time and trouble if we just check that before doing literally anything else:
The general rest of the logic here will match our GROM code from last time, just modified to not need VRAM and to ensure that we properly clean up after ourselves:
- Save the current GROM read address.
- Load the current song pointer (which
DOSONGput intoR13) and make that the new GROM read address. - Load two bytes from GROM to get the next sound list address. If it’s zero, copy the loop point (
R12) into the song pointer and repeat the previous step. - Increment the song pointer by two.
- Load the next sound list address into the sound list pointer and reactivate the sound list playback system.
- Restore the original GROM read address.
This took me a couple of iterations to get to where I was happy with it. We only really need to allocate one additional global variable here: the previous GROM read address from step 1, which we will put in R10. Everything else can be done via direct memory access.
Reading the GROM address is done a byte at a time out of >9802. The data comes out in big-endian order, and we have to stall an instruction between accesses. Due to internal caching in the GROM circuitry, the value we get is actually one past the value that is on deck to be read, so we’ll need to decrement the result to get the proper value:
LWPI >8300
MOVB @>9802,R10
NOP
MOVB @>9802,@>8315 * LOW BYTE OF R10
DEC R10 * CORRECT VALUE
Setting the new GROM address is done a byte at a time, big-endian, into >9C02, very like VRAM. We then pull our two bytes out via the port at >9800 and feed them directly into the sound list pointer. If we find we’ve read a zero, we reset R13 and loop back.
(Obviously we will not wish to set the loop point to a zero word or we’ll hang forever.)
Now that the sound list pointer is properly loaded, we may restart playback, restore the GROM and Workspace pointers to where they were, and hand control back to the firmware interrupt handler.
Putting It All Together
With that interrupt routine in place, and the new DOSONG entry point, we can add music playback to our original GROM code with two lines during initialization:
DST SONGPAT,@>8300
XML >73
With that done, the GROM is free to continue on its own, and there’s no changes needed to the structure at all. Mission accomplished!
What We Didn’t Do
Everything I’ve talked about here basically keeps the GROM in overall control of the application. It’s fairly clear to me that this is how the architecture was designed; while the Editor/Assembler package included a GPLLNK utility to call subroutines in the GROM from assembly language the actual mechanics of this technique involve doing stuff that’s complicated enough that I’m comfortable calling it “tricking the firmware into letting you do this.” The TI-99/4A Tech Pages have a dedicated page to all sorts of ways to manage calls and returns into the GROM and the interested reader is directed there; I’ll just try to summarize it here. Calling is relatively easy: edit the bytecode interpreter’s internal state so that the program counter is aimed at the routine you want, then jump into the interpreter itself. Returning is harder; the basic idea is to put a suitable return address onto the interpreter’s callstack before you transfer control to it, but that return address is in the GROM. If we also control the GROM code (like we have here, in a hybrid cartridge), that’s mostly fine; we can have a dedicated return function in the GROM with the instruction XML >F0, which will use >8300–>831F as the jump table—so our return address will now be at R0 in our usual call mechanisms! We load our CPU-level return address there before the call, and once we are called as “a subroutine”, we simply never relinquish control.
I think this is kind of a cop-out. If you can do this, you are providing your own GROM code and could just let the GROM code run the top-level show anyway. The only reason to want to do this is because you’re a pure-ROM cartridge or a program loaded from disk and CPU memory is the only thing you can edit. In this case you are obliged to essentially carry out a return-oriented programming attack on the system firmware itself to find some data bytes that look like a suitable XML call. If you’re disk software, you can at least rely on memory expansions being installed, which gives you more options for return address tables; unfortunately, you probably can’t rely on a particular version of the firmware being installed, which makes the whole thing very fragile.
Overall, I think it’s more trouble than it’s worth. It’s the 21st Century. The default form factor for new software will be “virtual cartridge image.” When transferring to real hardware, flash carts will cheerfully simulate not only GROM but writable GRAM. Creating hybrid applications where the GROM handles top-level orchestration affords a much cleaner and more sensible ABI. Sometimes, it just really is OK to take the path of least resistance.