I’ve wanted to write about this topic since I started this blerg, but various things have kept me from it. Stochastic computing is a subject I’ve been somewhat aware of since I went through the “Advances in Computers” series in the LBNL library while procrastinating over finishing my PhD thesis. It’s an old idea, and the fact that sci-hub exists now means I can look up some of these old papers for a little review. The fact that there is a startup reviving the idea (their origin paper here) gives a little impetus to do so.
Von Neuman had a [vague idea](https://books.google.pt/books?hl=en&lr=&id=A2CYDwAAQBAJ&oi=fnd&pg=PA43&dq=Von+Neumann,+…
I’ve wanted to write about this topic since I started this blerg, but various things have kept me from it. Stochastic computing is a subject I’ve been somewhat aware of since I went through the “Advances in Computers” series in the LBNL library while procrastinating over finishing my PhD thesis. It’s an old idea, and the fact that sci-hub exists now means I can look up some of these old papers for a little review. The fact that there is a startup reviving the idea (their origin paper here) gives a little impetus to do so.
Von Neuman had a vague idea along the lines of stochastic computing in the 1950s, but the first actual stochastic computing architectures were done more or less independently by Brian Gaines, Ted and Wolfgang (Ted) Poppelbaum in the mid-1960s. I first read about this in “Advances in Computers” in the LBNL library while procrastinating on writing my dissertation, and while I didn’t take any notes, it always remained stuck in my noggin. The once-yearly journal started in 1960 and I at least skimmed the abstracts all the way to 2002 or whenever it was I was procrastinating (physics libraries are way better than the internet/wikipedia for this sort of thing). My takeaway was that Tukey’s FFT was godawful important, there were a few cool alternatives to normal computer architectures, and climatological calculations have always been more or less the same thing (aka lame) with bigger voxels in the old days.
Stochastic computing was one of the cool alternatives; it was competitive with supercomputers back in the 1960s, and it was something a small research group could physically build. As a reminder, a $25 million (current year dollars) CDC6600 supercomputer of 1964 was two megaflops with space for 970 kilobytes of RAM. They made about 100 of them total for the world. An NVIDIA titan RTX has 24gb and runs at something like 130 teraflops. Your phone is probably 2 teraflops. Something like stochastic gradient descent was, therefore, not worth thinking about even on a supercomputer in those days. Fitting a perceptron or a Hidden Markov model was a considerable technical achievement. People would therefore use various kinds of analog computer for early machine learning research. That includes stuff involving matrices which were modeled by a matrix of resistors. Analog computers can do matrix math in O(1); it only takes as long as the voltage measurement takes. Analog used to be considered parallel computation for this reason.
One can think of stochastic computing as a way of digitally doing a sort of analog computing. Instead of continuous voltage levels, stochastic computing used streams of random (known distribution) bits and counted them. This has many advantages; counting bits is pretty easy to do reliably, where measuring precise voltages and reliably converting them back into bits requires much more hardware. It’s also more noise immune in that a noise event is an extra bit or two which doesn’t effect the calculation as much as some unknown voltage would effect an analog machine. Thinking about this a different way, if you’re dealing with machine learning, the numbers being piped around represent probabilities. A bit of noise hitting a stochastic stream of bits might add a couple of bits and is unlikely to make the overall probability greater than 1. If your probability is analog voltage levels, any old noise is very likely to botch your probability and turn it into some ad-hoc Dempster-Shafer thing (which hadn’t been invented yet) where probabilities sum to greater than 1. While I’m talking about probabilities here, let it be known that standard mathematical operations are defined in these computers: addition, multiplication (including for matrices), calculus: just like with analog computers.
Noise immunity is quite a useful property for 1960s era computers, especially in noisy “embedded” environments. TTL was available, but noisy and not particularly reliable for digital calculations. It was contemplated that such machines would be useful in aircraft avionics, and it was actually used in a small early autonomous neural net robot, tested in a British version of LORAN, a radar tracker and PID system.
The early stochastic computers were generally parts of some kind of early neural net thing. This was a reasonable thing to do as back then it was known that animal neural nets were pulse rate encoded. The idea lives on in the liquid state machine and other spiking neural nets, a little-loved neural net design which works sort of like actual neurons, but which is quite difficult to simulate on an ordinary computer that has to calculate serially. Piece of cake for stochastic computers, more or less, as stochastic computers include all the parts you need to build one, rather than simulating pulses flying around. Similarly, probabilistic graphical models are easier to do on this sort of architecture, as you’re modeling probabilities in a straightforward way, rather than the indirect random sampling of data in usedPGM packages. Brian Gaines puts the general machine learning problem extremely well,* “the processing of a large amount of data with fairly low accuracy in a variable, experience-dependent way.” *This is something stochastic machines are well equipped to deal with. It’s also a maxim repeatedly forgotten and remembered in the machine learning community: drop-out, ReLu, regularization, stuff like random forests are all rediscoveries of the concept.
This isn’t a review article, so I’m not mentioning a lot of the gritty details, but there are a lot of nice properties for these things. The bit streams you’re operating on don’t actually need to be that random: just uncorrelated, and there is standard circuitry for dealing with this. Lots of the circuits are extremely simple compared to their analogs on a normal sequential machine; memory is just a delay line for example. And you can fiddle with accuracy of digits simply by counting fewer bits.
Remember all this 1960s stuff was done with individual transistors and early TTL ICs with a couple devices on a chip. It was considered useful back then and competitive with and a lot cheaper than ordinary computers. Now we can put down billions of far more reliable transistors down on a single chip. Probably more unreliable transistors, which is what stochastic computers were designed around. Modern chips also have absurdly fast and reliable serial channels, like the things that AMD chiplets use to talk to each other. You can squirt a lot of bits around, very quickly. Much more quickly than in the 1960s. Oh yeah and it’s pretty straightforward to map deep neural nets onto this technology, and it was postulated back in 2017 to be much more energy efficient than what we’re doing now. You can even reduce the voltage and deal with
I assume this is what the Extropic boys are up to. I confess I looked at their website and one of the videos and I had no idea WTF they were talking about. Based Beff Jezos showing us some Josephson gate prototype that has nothing to do with anything (other than being a prototype: no idea why it had to be superconducting), babbling about quantum whatevers did nothing for me other than rustling my jimmies. Anyway, if their “thermodynamic computing” is stochastic computing, it is a very good idea. If it is something else, stochastic computing is still a very good idea and I finally wrote about it. They’re saying impressive things like 10,000x energy efficiency, which of course is aspirational at the moment, but for the types of things neural nets are doing, it seems doable to me, and very much worth doing. In my ideal world, all the capital that has been allocated to horse shit quasi-scams like OpenAI and all the dumb nuclear power startups around it (to say nothing of the quantum computard startups which are all scams) would be dumped into stuff like this. At least we have a decent probability of getting better computards. It’s different enough to be a real breakthrough, and there are no obviously impossible lacunae as there are with quantum computards.
Brian Gaines publications:
https://gaines.library.uvic.ca/pubs/
Good recent book on the topic:
Extropic patents: https://patents.google.com/?assignee=extropic&oq=extropic