Dissecting a C64 Autoboot Program

Most computers that supported disk drives also supported running programs off of the disk drive directly—the on-board firmware would check the disk drive as part of system startup, and then transfer control to a “boot disk” if it determined that one was present. This not only made it much easier to upgrade your operating system—just boot off a different disk—but meant that games could take complete control of the system as part of startup, shipping with their own bespoke disk control systems.

The Commodore 64 was not one of these systems. Its firmware always booted you to the BASIC prompt, and you needed to navigate BASIC at least a bit in order to load whatever disk program you hoped to run. For the most part this meant memorizing the famous LOAD "*",8,1 command and then typing RUN once the load had finished. However, some programs from that era didn’t require you to actually type the RUN command at the end; instead the program would start automatically once the load completed!

One technique for accomplishing this feat was published by Dan Carmichael in COMPUTE!’s Gazette in its November 1984 issue and later collected in COMPUTE!’s Third Book of Commodore 64. Like most of its advanced programs, it was published as a listing that was mostly a pile of numbers to be **POKE**d into place by a BASIC loader of some kind. I reverse-engineered COMPUTE!’s automatic proofreader programs a while back, so let’s take a look at this autoloader.

Unlike the proofreaders, the autoloader came with an article in front that explained the basics of the technique, so that part would have been no mystery to a reader in the 1980s. However, the actual loader program itself is opaque, and there are still some things to learn from examining those programs in detail.

Background

I’ve written before at some length on the interaction between machine language programs and the C64’s BASIC runtime. It, alongside the tables and descriptions in Sheldon Leemon’s classic Mapping the Commodore 64, provide some context for the technique:

The C64’s BASIC is a Microsoft BASIC descendant, and its implementation is relatively flexible about where exactly in memory all its data structures live.
In particular, there are three pointers that divide up the memory used by BASIC, named TXTTAB ($2B), VARTAB ($2D), and ARYTAB ($2F), that point to the start of the program text, the scalar variable space, and the array variable space, in that order; each ends where the next begins.
The LOAD command works with both BASIC and machine language programs. When loading a BASIC program, the program is loaded wherever TXTTAB tells it to; when loading a machine language program (this is what the ,8,1 suffix declares) the load address is instead taken from the program itself.
No matter what kind of program is loaded, the VARTAB and ARYTAB variables are adjusted to the final load address. For machine language programs this often deeply confuses the BASIC system and causes normal commands to fail with spurious out-of-memory errors. The NEW command, while ostensibly to erase any program currently in memory, actually works by fixing the values of the VARTAB and ARYTAB variables to be consistent with a 0-byte BASIC program.
BASIC, and the KERNAL firmware alongside it that serves as the C64’s core operating system, vector a lot of their core functionality through function pointers and parameters in the $0300-$0333 address range in RAM. The most common thing to override here is the IRQ handler so that programs may define their own video interrupts, but the system offers considerably more flexibility.
Programs with components from multiple files (say, a BASIC program with machine language support libraries and a data overlay) often used a dedicated loader program to load all the files and then transfer control to them. This involved printing out the necessary LOAD, NEW, SYS, and/or RUN commands, placing the cursor at the top of them, and then stuffing the keyboard buffer with sufficient RETURN keys to force all those statements to execute.
Control of the keyboard queue is pretty straightforward: there’s a ten-byte buffer of pending keystrokes at KEYD ($0277) and the value at NDX ($C6) holds how many unpressed keystrokes remain.

The Basic Technique

Carmichael’s technique relies on a small bit of luck in the C64’s layout: there are a few bytes of memory that the system does not use just before the vector overrides. This 89-byte region runs from $02A7 through $02FF.

The core idea here is to load our bootstrap program into this region, and then deliberately overflow it so that our loaded program smashes BASIC’s operation vectors. In particular, the “BASIC Warm Start Vector” at $0302 is executed just after printing the prompt to return control to the user for the next command; this vector is hijacked to instead point to the start of the program itself. 89 bytes turns out to be just enough to load and run a secondary program using the techniques in my old article, and his code does just that. For more sophisticated program layouts, he suggested using the autoloader to start another dedicated loading program that has the room to actually do the necessary work; such a program would generally be a dozen or so lines of BASIC and based on my own earlier work this path strikes me as obviously viable.

So far, though, this is just a summary of the articles themselves. It was a bit more coy about exactly how the loader did its work beyond the general notion of “hijack the warm start vector, then load and run the program you really wanted.” Exactly how to do that turns out to be meaningfully different for BASIC and machine-code programs, so we’ll look at the loader code it created. I’ve disassembled both of them and then turned the parameters and magic constants into symbolic labels or named parameters so that it’s easier to follow what’s going on.

The Loader Stub

Many parts of the task are the same for both BASIC and machine language, though. As a Commodore PRG file, it has to start with its own load address, which for us is the start of that unused memory region:

.word   $02a7
.org    $02a7

In order to carry out the technique in the article, we also have to end by padding the file until we hit the vector block at $0300, replicate the default value of the first vector, and then put in ourselves for the second:

.advance $0300
.word   $e38b,$02a7

The first thing to do after taking control is to fix the warm start vector, putting it back where it belongs, which, as it happens, is $A483, a location that’s actually in the middle of one of the main BASIC routines as Mapping the C64 reckons it instead of being the start of one.

lda     #$83
sta     $0302
lda     #$a4
sta     $0303

Now we have to actually load our target file. This is a multi-step process that requires four system calls. The first two essentially set all the parameters that we would send to an OPEN, LOAD, or SAVE command: the SETNAM routine sets the file name, and the SETLFS command sets everything else.

lda     #$08                    ; Open file #8
tax                             ; from device #8
ldy     #mode                   ; 0 for BASIC, 1 for machine code
jsr     setlfs
lda     #pathlen                ; Number of characters in filename
ldx     #<path                  ; Low byte of filename address
ldy     #>path                  ; High byte of filename address
jsr     setnam

The MODE and PATHLEN values here are written directly in place by the loader generator, as it happens, so they will vary from file to file. PATH is a 16-byte character array at the end of the program holding the name of the file to open. That array is always in the same place, but its contents, along with the value of PATHLEN, are filled in by the loader generator as well.

With these calls made, we are now ready to actually ask for the file to be loaded, but the LOAD syscall turns out to be a bit slippery. This system call is not only used to load files, but also to compare a block of RAM with the contents of a file (“verifying”). The accumulator selects which behavior we want; actually loading a file zeroes it, while verification sets it to 1. The X and Y registers then hold the low and high bytes of the program’s suggested load address. This is only used for BASIC programs; if SETLFS had a 1 for its mode parameter this is ignored. Still, the two loaders share a lot of code so both loaders end up suggesting the start of the BASIC program text as a load location. Once it is done, it then calls CLALL to close all open I/O channels. This is noticeably shorter than closing just the file we opened ourselves.

lda     #$00                    ; Load operation
ldx     txttab                  ; Low byte of load address
ldy     txttab+1                ; High byte of load address
jsr     load
jsr     clall

At this point, the code diverges depending on whether we were loading a BASIC program or a machine language one.

Auto-running an Assembly Language program

The machine code case is much simpler, as it has only three things to do:

Fix the BASIC runtime in case loading a machine code program messed up its internal state. This means doing what NEW does.
Running the actual program we loaded in.
Returning to BASIC once we’re done.

Each of these can be accomplished in a single instruction:

jsr     scrtch                          ; $A642
jsr     main
jmp     ready                           ; $A474

The SCRTCH routine at $A642 is just the implementation of NEW within the BASIC ROM; we may call it like any other subroutine. The main label is put in place by the loader generator; it’s the start address of the program we loaded (which need not be its load address). Finally, the jump to READY at $A474 returns control to the BASIC interpreter at the point where the main loop is about to print the prompt and go back to normal operation.

This is a bit different from the assembly language programs I’ve written and published here; those generally return to BASIC with an RTS command. However, those programs were written to be controlled by a (tiny) BASIC program and invoked with the SYS instruction within BASIC itself. By returning from the main program, we return to the BASIC program in progress just after the SYS statement had completed. This program scrambled and then reset BASIC’s own state and the runtime thinks that the whole program was just something to do instead of waiting for user input. We need to send it back to where we forced it to leave off.

Auto-running a BASIC program

The BASIC side also only does three things but they are three different things. First, it loads the Ending Load Address from EAL at $AE and copies it into VARTAB and ARYTAB so that BASIC’s variables will work right:

lda     eal
sta     vartab
sta     arytab
lda     eal+1
sta     vartab+1
sta     arytab+1
nop

I’m not sure why there’s a NOP here, but it does no immediate harm. We are desperately short on space in this loading region, but we do fit and shorter code just means we have to save out more filler until we hit the vectors we override.

Step two is to put three characters into the keyboard buffer:

lda     #'R
sta     keyd
lda     #$d5                            ; Graphic U
sta     keyd+1
lda     #$0d                            ; Return key
sta     keyd+2
lda     #$03
sta     ndx

There’s one slightly clever thing here. Commodore BASICs include shorthands for many statements to make them easier to type or to make longer program lines fit within the screen editor’s 80-character limit. Usually these shorthands are the first letter of the command followed by the graphic character associated with the second letter of the command. RUN is one of those commands even though it saves no keystrokes—but with the code as written here it does save us five bytes that we desperately need for everything to fit in our loading buffer!

That leaves the final step: returning to BASIC so that the keypresses may be processed and the program started in earnest.

rts

This… puts a bit of a caveat on my earlier discussion of why we had to JMP READY instead of RTS. Everything I said above about why we ought to do this is true; however, here we are **RTS**ing anyway and things seem to work. What gives?

Running through it in a debugger, what I find is that this RTS absolutely does pop a nonexistent return address off the stack, resulting in jumping to a “random” location. It’s not nondeterministic, I don’t think, because we’ll have gone through a deterministic set of calls to get to our loading location, but the value we pop is not actually a sensible return address. As a result, we end up in the middle of BASIC’s program memory, promptly attempt to execute a zero byte from the initial RAM fill, trap it as a software interrupt, do a non-destructive soft reset since the KERNAL treats this as equivalent to the user hitting RUN/STOP-RESTORE, and drop to a BASIC prompt where our RUN command is duly and faithfully executed.

We can see that in the execution as well; if we have them both load simple “Hello World” programs, the BASIC program clears the screen before it offers its RUN command, which is a consequence of the soft reset. The assembly language program, on the other hand, takes control so quickly that it barely finishes its status report and the “Hello world” appears right next to the “LOADING”.

We seem to have dodged a bullet here, but I suspect that BASIC programs larger that 25KB or so, or which are loaded after running a program that consumed more than 20KB or so of array space, would not be so lucky. Ironically, we could replicate the current behavior more reliably simply by cutting out the middleman and replacing the RTS instruction with a BRK ourselves.

As for why Carmichael put this discrepancy in in the first place, I suspect that it was due to space constraints. Even if we cut out the NOP above, there isn’t enough room left in our buffer to execute a JMP instruction before we have to sacrifice space in our filename to load.

The Loader Generator

We’ll be saving the framework BASIC code that was actually published in the article until next week. Most of that code just collects the necessary data and edits the code to be saved in-place. The part where it actually does the saving, however, is kind of neat. It turns out the API for SAVE is very, very different from the API for LOAD!

It does still require the SETLFS and SETNAM calls before it, but because SAVEing generally does not want a “secondary address” we actually have to set the Y register to $FF instead of 0 or 1. More interesting, though, is that SAVE needs two pointers to do its work: the start and end address. The end address (well, sort of; it’s actually one past the end address) is stored in the X and Y registers, low byte in X. We now have a 16-bit start address we also have to pass and we only have eight bits of register left. The calling convention here is to put the start address somewhere in the first 256 bytes of memory and then put that address, which is only 8 bits, into the accumulator. The loader generator uses $FB for its storage here, which is a good place to put it. (The $FB–$FE region is two pointers worth of space that BASIC and the KERNAL promise never to use, so if your own machine language programs intend to leave a minimal footprint, this is an excellent place to put your zero-page scratch values.)

One other fun thing about the machine-code component of the loader generator is that it manages some of the string-data input from the machine language side, and it allowed the full power of the C64’s screen-editor while it did it. It turns out that you just get that for free when you use the CHRIN system call instead of the GETIN one that you normally use when scanning the keyboard. It’s maybe a bit of a stretch to compare this to the buffered input that UNIX-like systems give by default when calling fread(stdin) but it’s only a bit of one.

To the Rescue?

I have a set of projects on this blog that I refer to as “type-in rescues”—taking old type-in programs and tuning them up to bring them to their full potential or to fix issues within them. I didn’t really intend to do that here, but there are enough little infelicities in the code that I wouldn’t want to use these autoloaders as-is.

The loader generator needed the most work, but if we’re going to build auto-loaders today, we aren’t going to use it anyway; we’re using cross-development tools already so we’re better off either with some template code we can fill in the blanks on as needed or a standalone generator outside of the emulated system. For my “final versions” of these loaders, the key change here was to move the filename buffer from the end of the program to the beginning. This lets the programmer have everything they need to edit right at the front of the file.

As for the code itself, the assembly-language loader is almost perfect, but I’d prefer to fix the cursor position before we start. We can do that with a simple LDA #13; JSR $FFD2 before calling main if we want to keep the screen intact, or we can swap the 13 for a 147 if we want to clear the screen instead. We have plenty of room to work with on the assembly language side, so adding this code doesn’t put any pressure on anything else. Even if we were concerned, we could drop the cost from five bytes to two simply by removing the call to CLALL, because the call to SCRTCH actually calls it as part of its function.

The BASIC side, however, is in much more dire shape, and it’s also under significant space pressure. Happily, while I was tuning it, I discovered that a lot of the pain stemmed from the desire to keep as much of a common prefix between the BASIC and machine-language loaders as possible. We gain significant breathing room if we relax this.

In particular, we no longer need to call CLALL immediately after our call to LOAD. The LOAD syscall, in addition to writing the Ending Load Address to memory at EAL, also returns it in the X and Y registers. By writing out the values to VARTAB and ARYTAB immediately, before calling CLALL, we can save four bytes of program code since we don’t need to consult EAL at all. This, combined with deleting the unnecessary NOP instruction, gives us five bytes of additional breathing room. This is enough to fix the RTS problem—it only costs us two additional bytes to replace it with a JMP READY instruction instead. Better yet, it lets us handle a case that the original didn’t contemplate at all; if the original BASIC program was **SAVE**d under a non-default memory configuration, this will be reflected in the saved file and this is supposed to be sorted out by the LOAD command on the BASIC side. The KERNAL-level LOAD does not do this, though, so we would expect such programs to misbehave if autoloaded. We can use the remaining three bytes of savings, though, to issue a call to the LINKPRG routine at $A533 which analyzes the loaded BASIC program and updates all internal pointers.

(How often would one run into this case, one might ask. More often than you’d think; COMPUTE!’s second-generation proofreader altered the memory layout to make room for its proofreader routine.)

One final thing we can do is alter the keyboard-stuffing process so that it uses a loop and a string array of some kind; this changes the incremental cost of characters from five bytes per additional character to one. It turns out that with our current implementation we have just enough space left after that conversion that we don’t have to compress the RUN command, which looks a bit prettier on the main screen.

A final possible optimization would be to “right-justify” the loader program, pushing the start address forward so that there is no wasted space between the end of the actual program and the start of the system vector table. This, ironically, buys us nothing at all. Programs are loaded off of disk a sector at a time, and each sector holds 254 bytes of program data. We’re under 100 even if we use every byte in our allotted region, so the system will always end up loading one sector. There is no meaningful time or space to save, and all of the code is dead code the moment the real program starts and may be overwritten without consequence. Even the DATA statements in the BASIC loader will oblige us to shave off eight full bytes before we save so much as a single line of code.

Downloads

I’ve dropped bugfixed and commented versions of the autoloaders as editable templates into the Bumbershoot GitHub repository (here are the assembler and BASIC loaders, respectively). These were already very fine utilities, and with access to modern cross-development tools, we can make them even better and integrate them cleanly into modern workflows too.