Where should portability stop? (OpenBSD)

In the spring of 1994, Sun Microsystems launched two now workstations in the SPARCstation family: the SPARCstation model 20, intended to replace the successful model 10, with the ability to use up to four processors, and the less expensive SPARCstation model 5, intended to replace the aging models 1, 1+ and 2, but limited to a single processor and only a maximum of 256MB of memory (while models 10 and 20 coul…

Although Sun hardware was already well known for reliability and network throughput, they were less appreciated for their graphics capabilities, which were decent, but nothing more. Users with strong graphics need would prefer HP or Silicon Graphics workstations, for which true colour (24 bit) frame buffers where easily available (and of course, if you wanted to use that promising new "OpenGL" technology, Silicon Graphics was the way to go.)

Most SPARCstation at that time were using a 8-bit colour frame buffer with some 2D acceleration features known as the LEGO (Low-End Graphics Option), but better known by its SunOS and Solaris driver name: cgsix, as this was the sixth colour frame buffer designed by Sun.

But if you had a need for a true colour frame buffer, you could either buy the special VSIMM for the SPARCstation 10 and 20, or Sun’s ZX double-width board, known to be quite power hungry and prone to overheating, requiring serious attention to the airflow in the workstation, or use a third-party option, such as the popular RasterFlex boards.

For customers not willing to pay the price of a SPARCstation 20 with either a VSIMM or a ZX board, Sun wanted to have a lower-end, but decent, true colour frame buffer. Which is (one of the reasons) why the SPARCstation 5, in addition to its three regular SBus expansion slots, also has a special connector, known as the AFX connector, to which only one board can be connected: the so-called S24 frame buffer option for the SPARCstation 5.

That card has a form factor very similar to a regular SBus card, but it is slightly wider, due to the position of the AFX connector on the SPARCstation 5 motherboard (the 3rd expansion slot can be filled with either an SBus card, or the S24.)

A picture being worth a thousand words, this is probably easier to understand, looking at that depiction of the SPARCstation 5 motherboard in the Sun Field Engineer’s Handbook:

Here is what the S24 board looks like (follow the links for much larger pictures); that board was sent to me by Thomas Serra, as a loan which became a donation, thank you again for this gift!

Component side:

Note that you can see, on the left edge, the first names of the tcx team members (Mike, Roger, Guy, Steve, Vernon, Kevin, Rajiv, Ed, Fiona, Chris, Joe, and I am wondering if ``Si’’ left of ``Fiona’’ isn’t also a first name, or a short form.) Also note that a few names are crossed, these people probably left the team before the completion of the project. Putting names on hardware is quite uncommon at Sun, but of the few projects which did, the later evolutions of the cgsix frame buffer, especially the high-resolution ones (XGX) allowing a 1280x1024 display instead of the usual 1152x964, also had first names written on an edge in a similar fashion, with some names crossed identically. There had been probably people working on both teams, or enough emulation between the two times, for them to decide to embed their names into the hardware they designed. Maybe a Sun old-timer can shed some light on this...

Bottom side, with the other half of the video memory:

In addition to the S24 board, a stripped-down (limited to 8-bit colour) version of the frame buffer was put onboard an also stripped-down version of the SPARCstation 5, released under the SPARCstation 4 moniker, about one year later. The SPARCstation 4 was an even more entry level SPARC workstation, with only one SBus expansion slot, no audio capabilities (but one could buy a specific expansion board to get them back), and only 5 memory slots, limiting it to a maximum of 160MB.

The internal drive bay of the SPARCstation 5 and 20, which would allow for up to two SCSI disks, one floppy drive and one CD-ROM drive, was also replaced by a simple mounting bracket for a single internal SCSI disk and nothing else; this allowed in turn the 150W power supply of the SPARCstation 5 to be replaced with a cheaper 50W power supply, reducing the costs further. (This also allowed me to mount a SPARCstation 20 SCSI backplane and its power supply into my own SPARCstation 4, to create a SPARCstation 4½ with two internal disks and better airflow around the processor.)

The SPARCstation 4 turned out to be quite popular in universities and technical schools, due to their reduced prices (as well as Sun’s aggressive policy to sell such systems at a loss, so that the future graduates would get used to Sun systems and would recommend them at their future jobs.)

When found onboard the SPARCstation 4, the frame buffer would identify itself as SUNW,tcx, and unsurprisingly the SunOS and Solaris drivers for it were called tcx.

Given the availability of SPARCstation 4, it did not take long for Dave Miller to write a simple tcx driver for Linux, and for Paul Kranenburg to write a simple tcx driver for NetBSD, both in 1996. At that time, the NetBSD and OpenBSD kernels had not diverged much, and there was an irregular code synchronisation work performed by Jason Downs, who brought that driver into OpenBSD in 1997.

And then, not much happened - you could use a glass console on these systems with the tcx frame buffers, and run X11, but they were handled as dumb frame buffers (i.e. a large memory area, with a programmable colormap.)

That might be worth a story in itself, but long story short, in the summer of 2002, I worked on converting the existing sparc frame buffer drivers to the wscons console framework; and as a side effect of that work, the S24 frame buffer got 24-bit support in the X server (while it was previously limited to 8-bit mode, emulating a bare cgthree.)

While working on that 24-bit tcx support, I had realized that this hardware was able to perform 2D acceleration in a roughly similar way to the cgsix driver, but unfortunately, I couldn’t find any documentation about that device. I expressed my frustration on the OpenBSD developer’s chatroom on july 26th:

<miod> grumble. that tcx fbc area is completely different from the cg6 one...
And of course, the only way to get more tcx info would be to register at
sun to download the solaris source code, only to get their bloody tcxreg.h
that I'm not even sure still exists. And I'm not sure you can still
download solaris source either.
<deraadt> miod, you can search for tcxreg.h on the net
<deraadt> that is how i found the entire codebase for their pcmcia chipset
<deraadt> ask XXX for a nice url where someone has solaris code on the net
<deraadt> kind of funny
<miod> Theo, I already searched for this file, and only found *BSD source code...
<miod> but I'd welcome XXX' url.

(XXX above is another OpenBSD developer who shall not be named in this story, in case Oracle lawyers feel hungry, even though these events took place 23 years ago.)

That topic came back two weeks later, on august 7th:

<jason> ss4's and ss5's take too long to reboot.
<miod> I don't have this feeling, jason
<jason> does ss4 require a vsimm to support video or is that video ram I see on
the motherboard under the disk?
<miod> this is a vsimm that will allow for higher refresh rate and resolution
<miod> but it has on-board video ram
<jason> ok, cool.
<miod> and if you ever find more docs about the tcx (such as sun's hidden tcxreg.h)
that enables me to write faster rasops code for it, you'll be my hero.
<miod> (this card is supposed to be as fast as a cgsix, if not faster)
<jason> Hmm, does X support it?  That's were I got "docs" for the creator.
<miod> X does support it as a cg3 emulation
<jason> Icky... that's not what I would call support =)
<miod> you get my point.
<miod> (plus, the tcx's cg3 emulation is incomplete (-:)
<jason> How can it be incomplete? cg3 doesn't DO anything.
<miod> tcx does not emulate the cg3 emulating an overlay plane.
<miod> (yes this is tricky)
<jason> linux doesn't support raster ops on it either?
<jason> Eep, no, it doesn't.
<miod> neither
<miod> hell, I have looked everywher (or so I think)
<miod> and solaris' man page tells you to look at tcxreg.h that is unfortunately
missing by mistake off their SUNWtcx* packages.
<miod> see how the world hates me?
[...]
<fries> http://www.doc.ic.ac.uk/~mac/manuals/solaris-manual-pages/solaris/usr/man/man7/tcx.7.html
.. you mean this man page miod?
<miod> yes
<miod> tcx(7d)
<fries> here's more torture, miod: http://www.mit.edu/afs/sipb/project/kernel/sunos.414/sys/sundev/tcxreg.h
<miod> permission denied, right?
<miod> 1/3rd of the interesting stuff you find on the mit afs is forbidden.
<jason> But that means if you can find a copy of 4.1.4 you might be able to find
a header file.
<miod> yes. unfortunately I only have 4.1.3_U1 as most recent SunOS media
<fries> I've a script that likes to traverse afs and look for things. putting it
to use looking for tcxreg.h ;-)
[...]
<XXX> miod: you have tcxreg.h in your mailbox
<miod> XXX, you're the best! thanks a million
<jason> It has defns for the raster ops?
<miod> I'm looking... just came back from shopping
<miod> some defines. perhaps not enough. I'll have to experiment a bit.
[...]
<miod> damn, tcxreg.h does not expose family's jewels - it's no use
<miod> [but at least I am not looking for it anymore]
<miod> ok, who has solaris source? (-,

(XXX above is another OpenBSD developer, but not the same one as above, who shall also remain anonymous in this story, in case Oracle lawyers feel hungry, even though these events took place 23 years ago.)

Although tcxreg.h has a lot of interesting content, it was not enough to let me make that frame buffer run fast, and I was quite disappointed. You can see its contents for yourself here on Github.

Then on the 20th of december 2008, I stumbled upon that gem of a document, written by der Mouse (that’s a pseudonym, in case you were wondering), which is still available online, and starts with

This document is my own description of the S24 interface, written based
on documentation I am not prepared to distribute verbatim.  The
documentation appears to have been converted badly between formats and
may be missing fragments; I can only guess what may be missing.  The
description below assumes the CPU is a SPARC, which is not unreasonable
since the S24 exists only for the SS5.

The document also ends up with a summary of the acceleration capabilities:

As far as I can tell the {,R}{STIP,BLIT} spaces constitute all the
acceleration the S24 has.  For window fills, STIP/RSTIP space is a
factor of two faster than writing to DFB8 and a factor of 8 faster than
writing to DFB24/RDFB32 (assuming of course that bus transaction speed
is the limiting factor), and the BLIT/RBLIT spaces permit copying 32
pixels with one 64-bit write, as compared to a four 64-bit read and
four 64-bit writes to copy 32 pixels through DFB8 space, or 16 reads
and 16 writes through DFB24/RDFB32 space (under the same assumption.)
Not a whole screaming lot of acceleration, but significantly better
than nothing.

If you are interested in the technical details, you can read that document, it is not that long; but, as far as 2D acceleration is concerned, here is a summary of what it reveals.

First, unlike most 2D accelerated frame buffers of that era, there is no dedicated Drawing Engine or Transformation Engine, into which one would program the coordinates of the areas to work on, maybe some colour information, and then a command code telling the Engine what to do.

Instead, acceleration features were split into two basic parts:

A stipple area, where commands would cause series of pixels to be drawn in the given colour; this would be used for fill operations (clearing parts of the screen, drawing backgrounds...).
A blit area, where commands would cause frame buffer to frame buffer copies; this would be used for scrolling operations (copying data from one area to another).

Also, the blit area might not support the common raster operations codes, and could only copy data in the most stripped-down designs, such as the on-board frame buffer on the SPARCstation 4; the S24 would have more room for chip features and would support more operations, especially the invert operation, allowing the console cursor to be drawn by performing a blit of the current character cell onto itself in invert mode.

All of this gives the impression that it was possible (and intended) to be able to build tcx frame buffers a la carte with various levels of acceleration features, in order to save chip complexity, size, and ultimately costs, and I wouldn’t be surprised this was one of the major requirements of this project at Sun. This also would explain why the S24 provides all the features, while the on-board frame buffer on the SPARCstation 4 provides a much smaller set of features.

An interesting tidbit of the blit and stipple spaces, is that one triggers commands by performing 64-bit writes into these spaces, with the 64-bit double word containing command parameters (colours, operation width, plane masks...), while the offset from the beginning of these spaces, easily computed from the address of that 64-bit write, would encode screen coordinates (the topleft corner of the area being concerned).

This is quite smart, as this means your 64-bit command word is actually more like a 86-bit command word, once you take the offset into account, and this allowed for the commands to be more versatile - if your particular tcx implementation would support this, of course.

After processing that information (read: understanding it, then thinking of how I could use it), I started to tinker with the tcx driver code. I made progress quite quickly, as shown by a bunch of activity on the tcx.c file on december 23rd (2 commits in the late evening, revisions 1.32 and 1.33) and 24th (7 commits in the afteroon and the evening, from revision 1.34 to revision 1.40.) More minor improvements (from revision 1.41 to revision 1.43) occured during the next couple of days.

I also asked for volunteers to help optimizing the logic, in case I had missed something...

<miod> actually, if anyone has brain cycles to spend at an interesting problem,
I'm interested in any improvements in tcx(4) accelerated code. I've tried
to move invariants out of loops, and minimize use of multiplication
(although being on sun4m it is not that critical), but I think it is
possible to win more cycles here and there.
<kettenis> I'm more interested in unaccelerated ifb(4) working properly ;)
<miod> i spent hair and brain cells on it already.
<miod> i value my brain cells while i still have some.
<miod> at least for tcx i got incomplete docs, but i could build the big picture
and have a real challenge with dire-chip algorithms.

(ifb was another frame buffer driver for Sun’s XVR-500 boards we were working on at that time, which got significantly improved a few days after that conversation took place.)

...but asking this on Christmas eve is probably not the most appropriate time, even in the OpenBSD developer community, and I did not receive any help.

One of the cleanup commits (1.39) however had this log message:

Get rid of all remaining magic numbers but 32. If you need to know why 32
is magic on a 32-bit platform, maybe you shouldn't do kernel programming.

which caused the following caustic commentary, the day after:

<otto> i like miod's magic number commit message
<otto> we should have a list of official magic numbers, like 0,1,2,31,32
<dlg> #define MAGIC_32 32

However, this story is pretty mundane so far: developer gets hardware, eventually gets hardware documentation, figures out which registers to frob to get a working driver, and goes onto the next item of the todo-list.

B o - r i n g.

There is something which makes the tcx driver special, though.

Frame buffer drivers require the kernel to setup various memory mappings to be able to address the frame buffer memory, as well as various registers. Compared to, say, a storage driver, these needs are much more important, because frame buffer memory amounts to up to a few megabytes of memory.

But tcx is worse, because of the way the stipple and blit acceleration areas of the board are set up. Even though the frame buffer memory itself would be 1 or 2MB on the SPARCstation 4, depending whether a VSIMM is present, and 4MB on the SPARCstation 5 with S24, the stipple and blit zones would be even larger, at least 8MB each. So an accelerated driver, written in a portable way, would need, in the worst case of the S24, 20MB of kernel virtual memory, backed by physical addresses on the AFX bus.

But the 32-bit sparc kernel are quite constrained on the memory layout. There are no separate kernel and userland address spaces, so, just like on i386, the 4GB virtual memory space is split into a top part for the kernel, and a lower part for userland. But while on i386 the 4GB virtual address space is whole, and a 3GB userland - 1GB kernel split has worked quite well, sparc is not as lucky here.

The first generation of SPARC processors (v7) were using a custom MMU based upon the existing Sun-3 MMU (the Sun-3 series being Motorola 68020-based systems, later 68030-based with the Sun-3x, which were the ancestors of the Sun-4 series, the first SPARC-based systems.)

The Sun-3 MMU did not allow for a complete 32-bit virtual memory space, but only 28 bits (256MB.) This means that, of a 32-bit virtual address, only the low 28 bits were significant, and the 29th to the 32nd bits were ignored.

In practice, when looking at 32-bit addresses in hexadecimal, this means that the only valid addresses, sign-extended, are 0000.0000 to 07FF.FFFF (the low 128MB) and F800.0000 to FFFF.FFFF (the high 128MB.) Any address in-between is unreachable.

Even though, on the first SPARC systems (the Sun-4/260, running at a mere 16.67MHz), the address space was 30 bits wide and thus the hole narrower, with valid addresses ranging from 0000.0000 to 1FFF.FFFF and E000.0000 to FFFF.FFFF, the SunOS kernel was linked at address F800.0000; this intermediate address in the upper part of the address space was probably considered reasonable enough, at that time (1987!) to allow for 128MB of kernel virtual space.

SunOS leading by example, the 4.4BSD port to the SPARC hardware (and ancestor to both NetBSD and OpenBSD sparc ports) also chose to link the kernel at F800.0000.

With the kernel image loaded at this virtual address, there is only 128MB for the kernel image and data, and its page tables, until one reaches the end of the address space. In practice, this means that the kernel virtual memory space, for dynamic allocations, is a bit larger than 100MB. Allocating 20MB in it for the sake of a single driver is quite the dent.

Of course, that 30-bit address limit only concerns the first three generations of SPARC hardware, named (by Sun) sun4, sun4c and the rare and easily forgotten sun4e (which is a cross between sun4 and sun4c.) Sun realized soon enough that this limit would be a significant thorn in their feet, and worked on a new MMU, known as the SRMMU (Sun Reference MMU), which allowed the processor to use a complete 32-bit virtual address space (and a 36-bit physical address space, allowing high-end multiprocessor systems such as the SPARCserver 1000 and the SPARCcenter 2000, both of the sun4d architecture, to use more than 4GB of physical memory, if you were able to afford it.)

On SRMMU-based systems, there was no apparent hole in the 32-bit virtual address space, and the kernel could be linked at a lower address (for example, F000.0000 or E000.0000) and have more virtual memory space, still allowing at least 3GB of virtual memory for userland process.

This is one of the reasons why Sun would ship different kernels for every SPARC subfamily (sun4, sun4c, sun4e, sun4m, sun4d), with the sun4d and sun4m kernels linked at F000.0000 to let the kernel use more memory.

But in the BSD crowd, there has always been a strong will to be able to provide a "one size fits all" kernel (known as GENERIC), which would work on all systems. And for the sparc ports, this meant linking the kernel at address F800.0000 so that it could run on any of the SPARC machines.

We also thought that this would mean the sparc port would be forever limited to that meager 128MB of kernel virtual space; although in the later days of the OpenBSD/sparc port, I reworked this to keep the kernel linked at F800.0000 (although I really should have tried to link it at a lower address and see if it would still run on the oldest Sun-4 systems, but that idea did not cross my mind), but allow for 256MB of kernel virtual memory space on sun4, sun4c and sun4e, and 1GB on sun4d and sun4m. However, this work did not happen until march 2015; in december 2008, when I was working on tcx console acceleration, the 128MB limit was still standing. And mapping 20MB of device memory in order to get a faster console didn’t look like a good trade-off to me.

Could there be a better way?

It turns out there was, but at the expense of portability.

The tcx hardware has been used on three Sun systems in total:

the SPARCstation 5, with the S24 board which can not be used on any other system.
the SPARCstation 4, as the on-board frame buffer.
the JavaStation 1, as the on-board frame buffer. (Not surprising, as its hardware design is heavily based on the SPARCstation 4.)

All these systems have a soldered processor of the sun4m family. Now, an interesting feature of all SPARC processors is that they have some memory access instructions (loads and stores) which use a so-called Address Space Identifier, and which are heavily used in the kernel to perform specific actions (access to special internal registers to perform MMU or cache operations, cache bypass, endianness conversion, etc.)

A few ASI are common to all SPARC processors, as required by the architecture, but most of them will differ across subfamilies. The sun4m processors have a "bypass" ASI, which instructs the processor to, well, bypass the MMU and do not attempt to translate the address. In other words, it allows the kernel to perform a regular load or store operation, without the need for the address to be mapped somewhere in virtual memory.

This is exactly what I needed for tcx! Those large stipple and blit spaces, which are only written to, could be used with that ASI over their physical address, and I would not need to waste 16MB of kernel mappings (only the frame buffer memory would need to be mapped.)

Because accesses using specific ASI turns out to be a quite frequent operation in the bowels of the SPARC kernel memory management code, there were already convenient compiler macros to let them be used by C code, in sys/arch/sparc/sparc/asm.h/:

/*
* ``Routines'' to load and store from/to alternate address space.
* The location can be a variable, the asi value (address space indicator)
* must be a constant.
*
* N.B.: You can put as many special functions here as you like, since
* they cost no kernel space or time if they are not used.
*
* These were static inline functions, but gcc screws up the constraints
* on the address space identifiers (the "n"umeric value part) because
* it inlines too late, so we have to use the funny valued-macro syntax.
*/
[...]
/* load int from alternate address space */
#define lda(loc, asi) ({ \
int _lda_v; \
__asm volatile("lda [%1]%2,%0" : "=r" (_lda_v) : \
"r" ((int)(loc)), "n" (asi)); \
_lda_v; \
})
[...]
/* store int to alternate address space */
#define sta(loc, asi, value) ({ \
__asm volatile("sta %0,[%1]%2" : : \
"r" ((int)(value)), "r" ((int)(loc)), "n" (asi)); \
})

For the use of tcx, I only needed 32-bit and 64-bit stores using ASI_BYPASS, thus the sta macro and its 64-bit variant stda (not shown here, using the stda assembler instruction instead of sta.)

This turned out to work very well, and I could get rid of the code mapping the blit and stipple spaces. All that remains of this is a comment in tcx_accel_init:

/*
* On S24, try and map raw blit and raw stipple spaces.
* We prefer the raw spaces so that we can eventually switch
* between 8 bit and 24 bit modes with blitter operations.
*
* However, on 8-bit TCX, these spaces are missing (and empty!),
* so we should fallback to non-raw spaces in this case.
*
* Since this frame buffer can only exist on SS4 and SS5,
* we can rely upon the fact this code will only run on sun4m,
* and use stda() bypassing the MMU to access these spaces,
* instead of mapping them (8MB KVA each, after all, even more
* on an SS4 with the resolution extender VSIMM).
*/

So we ended up with a driver, written in C, but using macros which would introduce many inline SPARC assembly constructs. The code looks like it is portable, but it isn’t.

Is that a bad thing?

After all, the driver source code lies in sys/arch/sparc/dev/ since it is a sparc-specific driver, so no other platform should try to use it.

The fact that I intentionally wrote the code this way and commited it to the OpenBSD source tree, tells that my answer to the question is "it was worth doing that way".

The fact that no other OpenBSD developer complained about the way I had written this code does not necessarily mean that they approved that decision. It is more likely that noone cared about that frame buffer to be interested enough to look at my code. Yet I still hope some delevopers did and considered it was acceptable, but let’s be realistic: noone could care less about the tcx frame buffer in 2008, and I should grow up. But who can resist such an interesting challenge?

(Note that QEMU, when emulating a SPARC system, allows you - or used to allow you - to pick hardware combinations which make no sense, such as a sun4c class processor and a tcx frame buffer. This does not match any hardware ever created, and I would expect OpenBSD to fail horribly due to the use of sun4m ASI in the driver. But such braindead configurations should be rejected by QEMU in the first place.)

Similar Posts