Revisiting Karpathy's 'Unreasonable Effectiveness of Recurrent Neural Networks'

RNNs

The biggest difference that stands out up-front between RNNs (including LSTMs) and traditional neural networks is that RNNs, in effect, do not have a fixed-size input vector. (LLMs are a funny case, more about that later.)

It’s obvious that a simple neural network has a fixed number of inputs; this one has three, for example:

A simple neural network

Figure 1

When you’re dealing with sequences of inputs, that’s not ideal. Pieces of text – eg. different prompts for an AI, or different texts to translate – can vary in length.

Now, you could just have as many inputs as the maximum sequence length that you want to be able to handle, and then pad shorter sequences out to the full length. But then you’d be doing the same amount of work, regardless of the length of the input sequence. Not very efficient!

As Karpathy explains, the solution in RNNs is that you input your sequence one token at a time into a network. The network has a hidden state, and each incoming token is processed by applying it to the hidden state to create a new state 1, and then using the hidden state to produce the output. That updated hidden state is used by the next invocation of the network, and so on. (You might wonder how on earth you can train these; that’s an interesting point and I’ll write about it later in this post.)

So, that hidden state is what is keeping a memory of what has been seen so far, and can guide future outputs (in combination with the new inputs).

Why don’t LLMs use that trick or something similar and get an infinite context length? After all, you can keep feeding in new inputs forever, and the hidden state could represent everything it’s received so far.

I think that a throwaway comment in the post is a good hint:

In fact, it is known that RNNs are Turing-Complete in the sense that they can to [sic] simulate arbitrary programs (with proper weights). But similar to universal approximation theorems for neural nets you shouldn’t read too much into this. In fact, forget I said anything.

There’s a clash between theory and practice, and I think the caveat there is hinting at the fixed-length bottleneck. Turing completeness means that (in this case) an RNN can run any program that a Turing machine can run. Now, mathematically speaking that is true, but Turing machines have what amounts to infinite memory.

If the floating-point numbers in a hidden state had infinite precision, then you could store infinite amounts of data in them (indeed, you could in theory store an infinite amount of data in a hidden state with one number!). But in practice, floats have a specific precision, 32 or 64 bits or whatever, so there’s a limit to how much you can jam into the hidden state.

This is exactly the fixed-length bottleneck problem that attention mechanisms were designed to solve (more about that in this post). Combined with issues with training RNNs (again, more on that later), it was what led people away from them.

The interesting thing about the LLMs that we use these days is that they approach things from a different angle. We actually do feed in all of the inputs at once, but the architecture is designed so that they accept any input length (up to the context length) by feeding the input in as a tensor – that is, the whole sequence is a single input. And we solve the issue of the fixed-size bottleneck on hidden state by having our equivalent of hidden state be the “context vectors” (the terminology in Raschka’s “LLM from scratch” book) that are passed from layer to layer, which taken in aggregate are a hidden state that scales directly with the number of tokens in the input sequence.

Now, of course, there is that context length limitation. But that is imposed by architectural choices – for example, for GPT-2, things like how many positional embeddings it has – and, of course, is also in practice limited by how long sequences in the training data were, as an LLM that was trained only on 128-token sequences is going to get confused with longer ones. But that, at least in theory, is something we can throw money at – larger models and larger context windows are in theory “just” a matter of how much money we spend on compute (and people are spending a frankly terrifying amount).

The real problem with LLMs, when compared to RNNs, is computational complexity. Although there are lots of tricks one can use at inference time to get it down, at least at training time the complexity in both space and time for an LLM, for a sequence of length n, is O(n2). For an RNN, by comparison, at inference time the space used is fixed, O(1), (which to be fair is also the problem!), at training time it’s O(k) where k is how far we’re unrolling it (again, more about that later), and the time complexity is O(n) on the sequence length.

But ultimately, it’s a trade-off. Both RNNs and LLMs solve the same problem – handling variable-length sequences, and doing more “thinking” for longer ones – but in different ways. RNNs do it by running the same network again and again, keeping track of what they’ve seen so far in the hidden state. On the other hand, LLMs do everything in parallel in a single pass, using tensors of inputs and working state (the context vectors passed between layers) that vary based on the sequence length, and therefore need more calculations to process for longer sequences. And they both have limits on the effective length of the sequences they can handle, RNNs due to the fixed-length bottleneck, and LLMs more explicitly due to architectural choices and training.

So that covers the basics of RNNs and how they differ from LLMs (apart from training, which I’ll keep to the end of this post). It’s worth noting that what I wrote above was about what he calls “vanilla” RNNs, and all of the code he uses later is based on more advanced Long Short-Term Memory networks, but as far as I can tell, the above still applies to those (and if it doesn’t, hopefully we’ll discover why over the next posts in this series).

Let’s take a look at the other differences.

The activation function

This is a small difference, but an interesting one – the code sample in the post uses np.tanh as an activation function. In the LLM I’ve been building based on Sebastian Raschka’s book, we use GELU – and ReLU pops up quite a lot too. From what I gather, using tanh was just standard practice for RNNs at the time – LSTMs use it in combination with sigmoid. Though I was interested to find while researching this post that the paper introducing GELU was only published in 2016, which is a pretty solid reason not to use it in 2015 ;-)

Now something larger-scale.

Bytes, not tokens.

In current-day LLMs, we split up our input text into tokens. The specific tokenisation strategy we choose is generally based on the training data – which sequences tend to occur frequently?

So, using the GPT-2 tokeniser, “The fat cat” breaks down into these three tokens:

‘The’, ’ fat’, ’ cat’

The Portuguese equivalent “O gato gordo” (presumably less-represented in the training data) breaks down to more tokens:

‘O’, ’ g’, ‘ato’, ’ g’, ‘ord’, ‘o’

Longer and rarer words, regardless of the language, often wind up being split into different tokens. For example, “Pseudoscientists prevaricate habitually” is the following GPT-2 tokens:

‘P’, ‘se’, ‘udos’, ‘cient’, ‘ists’, ’ pre’, ‘var’, ‘icate’, ’ habit’, ‘ually’

That means that our inputs to the LLM are token IDs, which are the “units” that the LLM uses to think about them.

The RNNs in Karpathy’s post don’t bother with any of that. The input is just something representing one byte (he says “character”, but the code actually works fine for arbitrary bytes) in the input sequence – in other words, if you were feeding in “The fat cat”, you’d feed in “T”, then “h”, then “e”, then “ “, and so on.

The only slight oddity I can see (and this is from the code rather than the post) is that he seems to build a set of all of the different bytes in the training data (let’s say there are n of them), and then assign each one an ID 1..n 2, and build the network with n inputs. Then he feeds in the “byte ID” as a one-hot vector (he uses the equivalent term “1-of-k encoding”).

I guess with text, this saves you quite a few inputs – for example, for plain ASCII you would have some number <=128 of possible bytes that you could use, so you’d have that number of inputs for one-hot. By contrast, if you just used the raw value of the byte for your one-hot, you’d need 256 inputs, and that might wind up being wasteful and harder to train.

Still, the byte-level (or as Karpathy puts it, character-level) nature of these RNNs is a big difference, and it makes it all the more amazing that these examples work. It feels like LLMs are starting with a huge advantage, because from the get-go, even without training, they have some kind of built-in understanding of words – or at least, tokens, which are not too far off – while the RNNs need to learn about the very concept of words ab initio.

What’s interesting is that they seem to learn about it quite quickly – in the “The evolution of samples while training” section, Karpathy shows that the fact that sequences tend to consist of space-separated groups of letters seems to be learned after not that many iterations. Nifty :-)

The other end of the network is more familiar-looking, however. We have as many outputs as “byte vocab” that was built up above, and we treat them as logits – that is, we just run them through softmax and use that as a probability distribution over which byte (or rather, byte ID) is most likely to appear next. 3

One other thing to note before we move on from this is from the “further reading” section – he says:

Currently it seems that word-level models work better than character-level models, but this is surely a temporary thing

Ah well.

(One thing that does occur to me is that it might be interesting to give an RNN a “front-end” similar to an LLM – that is, run the input text through the GPT-2 tokeniser or something similar, then zap the result through an embedding layer, and then do the normal RNN stuff, and project out to vocab size at the end – maybe even just with a regular RNN layer rather than an FFN! Maybe something to play with once this mini-series is done.)

OK, let’s move on to the tricky bit. How do we train RNNs?

Training RNNs

Karpathy doesn’t cover training in the post, but there are throwaway lines that do more than nod to it – for example:

Technical: Lets train a 2-layer LSTM with 512 hidden nodes (approx. 3.5 million parameters), and with dropout of 0.5 after each layer. We’ll train with batches of 100 examples and truncated backpropagation through time of length 100 characters. With these settings one batch on a TITAN Z GPU takes about 0.46 seconds (this can be cut in half with 50 character BPTT at negligible cost in performance)

OK, first things first – dropout of 0.5, yikes! 0.1 is typical for a modern LLM. However, various AIs reassure me that I’m not misreading – we really are dropping out half of our neuron outputs while training. Apparently that was normal for RNNs – they just trained better that way. Interesting!

But the more interesting thing is how we actually do the training. The whole concept of a hidden state doesn’t play well with the model of how neural network training works as it’s normally taught.

Let’s say we fed in “The fat cat” byte-by-byte; we’d run the network firstly on “T”, store the result, then run it on “h”, store the result, then on “e”, store the result, and so on. Our hidden state would be updated inside the network and stored each time, so – for example – on that third call, with “e”, it would have information that somehow represented that it had already seen “T” and “h”.

Just as with the LLMs, we have a target sequence that we want to be producing – and just like with LLMs, it’s the shifted-left sequence plus an extra target – we’re working per-byte, so that would be “he fat cat “ (note the space on the end). We want the first invocation of the RNN, with “T” as its input, to produce “h”, then the second to map the input “h” to “e”, and so on.

And again, just like with LLMs, we use cross entropy loss to evaluate our results. We compare whatever sequence we got from the 11 calls to the RNN that processed “The fat cat” to the expected output, “he fat cat “, and get a loss. We then use that to generate gradients and just use those to adjust the parameters. Some stripped down PyTorch code that does that:

train_loss = calculate_loss(y_logits, target_y_ids)
train_loss.backward()

optimizer.step()

Simple, right? But “just” is doing a lot of work in that sentence. PyTorch is doing quite a lot of magic for us, and it’s hard to map from whatever it’s doing in its computation graph to the much easier-to-visualise process of doing backpropagation on a normal neural net, where our loss goes firstly to the last layer to work out gradients there, then to the second-last layer, then the third-last, and so on. What about that hidden state? How does that get mixed in?

Conceptually, we can see the training of the RNN as being “unrolling it” – or, to use Karpathy’s phrase, backpropagation through time. That is, you can imagine it as repeating the neural network as many times as we had input items in our sequence, feeding the inputs through each one, and backpropagating through that, with the hidden states going through too. So a five-layer neural network trained on a sequence of length ten would turn into a 50-layer network – a normal neural network, and we can backpropagate through that!

That, I think, needs a bit more unpacking. Let’s give a couple of examples of what an unrolled network might look like.

One-layer networks

Let’s look at single-layer networks first, because they’re simpler. Firstly, we’ll start by changing our view of the RNN. Instead of seeing the hidden state as a variable held inside the network – which is what it is in practice – we can imagine the RNN as being a network that takes an input and a hidden state, and produces an output and a new hidden state. The passing in of the new hidden state to the network in subsequent iterations would be an aspect of the code that was using it, something like this pseudocode:

hidden_state = zeros()
for ii in inputs:
output, hidden_state = rnn(ii, hidden_state)

(PyTorch actually does something not dissimilar, so we’re not going too far off at this stage!)

Let’s sketch that out:

An RNN viewed as a simple NN with the input and hidden state going in, and the output and new hidden state coming out

Figure 2

Of course, an RNN isn’t really a normal neural network – what it’s doing with its inputs and hidden states isn’t quite what happens in one of those – but it’s pretty similar. 4

Now, imagine we’ve run it on a sequence of three inputs, and we want to backpropagate. We “unroll it in time”, which in our model we can treat as being taking three copies, one for each element in the sequence, and running each input through one of them, passing the hidden state through, like this:

An RNN unrolled for a sequence of three elements, with one copy for each

Figure 3

That looks like something we can backpropagate through! We have our inputs (and an initial hidden state going in, and our outputs coming out. We can ignore the final hidden state for the purposes of loss, and we’re done. Although things are skipping layers (eg. input 2 goes straight into the second layer of our unrolled network), that’s no weirder than the residual/shortcut connections in an LLM.

If you’re happy with that as a model for what’s happening with an unrolled RNN, then I recommend you skip to the next section on multi-layer RNNs, as the version above is closer to the reality than the next bit.

However, when I was looking at RNNs for the first time, I was uncomfortable with those connections that skip layers. If you’ve only done regular NNs where everything goes through every layer, it looks weird and might be hard to get your head around, like it was for me. The model I came up with was to posit a set of neurons that have weights that are fixed at one – we’d have a set that all ignore input 1 and 3, and just pass through input 2 unaffected. Likewise we’d have two of those for input 3, and some for the outputs too – like this:

The same unrolled RNN using ‘passthrough’ blocks to avoid shortcut connections

Figure 4

Now that looks very much like a traditional network that we could run traditional backpropagation on.

The passthrough connections are a little artificial – they are pinned to weights of one, and the network isn’t fully connected (we don’t want input 1 leaking into the passthroughs for inputs 2 or 3). But I found it a useful stepping stone when I was moving from traditional networks where everything has to flow through every layer to something with connections that skip a layer, like in figure 3. If you’re in the same position, I hope it helps, but I do urge you to try move on to the version with shortcuts, as it’s closer to the mathematical reality of what’s going on. Think of this last model as being training wheels ;-)

Multi-layer RNNs

That’s reasonably clear for a single-layer network. Multi-layer networks are a little more complex; each layer in the RNN has its own hidden state. If you look at the sample code in Karpathy’s post, he models a 2-layer recurrent network like this:

y1 = rnn1.step(x)
y = rnn2.step(y1)

That’s not dissimilar to how a normal neural network works, but rnn1 is storing its own hidden state, and rnn2 its own different one. For the example of the three-element sequence above, we want our “unrolling in time” to map to a regular neural network that we can backpropagate through.

That’s not the end of the world, though – we can just adjust our “hidden states are outputs” model of an RNN to handle it; in pseudocode:

hidden_states = (zeros(), zeros())
for ii in inputs:
output, hidden_states = rnn(ii, hidden_states)

– that is, I’m treating the hidden state as a tuple of size two, one element for each layer’s hidden state. We can diagram that like this, just showing the inputs and the outputs and ignoring what’s going on inside:

An unexpanded two-layer RNN showing hidden states for each layer going in and out

Figure 5

But inside, we can do the same kind of thing as we did with the unrolling itself, with the appropriate hidden state going to each layer, while the input goes to layer 1, and layer 1’s output goes on to layer 2. The output comes out of layer two, and the updated hidden states are just passed directly out:

An expanded two-layer RNN showing internal routing of data and hidden states

Figure 6

Again, if you’re uncomfortable with connections that skip layers, we can model those as do-nothing passthrough neurons, like this:

An expanded two-layer RNN showing internal routing of data and hidden states, using passthrough neurons

Figure 7

Now that we have a model of how one instance of a two-layer RNN can be modeled as a more traditional NN, hopefully you can see how that might be used in an expanded version of the unrolled model above in figure 3 (or 4 if you prefer passthroughs to shortcuts). It’s clearly getting unwieldy, but hopefully it’s clear that what we have after unrolling is a normal neural network that you can backpropagate errors through.

Once again: this really is just a mental model that’s helpful to see how you can “unroll in time” an RNN – what actually happens, especially with an automatic differentiation system like we have in PyTorch, is going to be pretty different.

But it’s close enough to the reality that we can use it to intuit how things work and understand problems.

Wot no gradients? (Or possibly “ouch”)

One obvious issue is the depth of the network. The deeper a network is – that is, the more layers – the more prone it is to vanishing gradients, where during backpropagation, the further away from the end you get, the smaller the gradients get until they completely disappear. There’s also the opposite problem, exploding gradients, where as a result of how different layers interact, the gradients shoot up to infinity.

Now, there are tricks to avoid them – I wrote about shortcut connections in my main LLM series – but they do involve serious changes in the way that we think about the network’s architecture.

And with RNNs, the depth of the unrolled network is directly linked to the sequence length. After all, a five-layer network fed (say) a 1,024-item sequence is going to unroll into a 5,120-layer network. Good luck backpropagating over that. Well before you get to the copy of the network that represents the first input sequence, you’ll either have no gradients or infinite ones.

One partial solution to that is the “truncated backpropagation through time of length 100 characters.” that Karpathy mentioned in the quote above.

When you’re training an LLM – let’s say on “War and Peace”, you would firstly slice it up into sequences, each as long as your LLM’s context length. You’d then treat them all as being pretty much independent – you’d put a random selection into your validation (and maybe test) sets, then train on the remainder in no particular order, working out the loss between the results of feeding each sequence through the LLM, and the shifted-left target sequences.

Training an RNN is very different. If you had a batch size of 1, and you were training on “War and Peace”, you might in theory run the entire book through, byte by byte, then work out your loss on what it predicted as it went along.

With 3,359,652 characters in the copy I downloaded from Project Gutenberg, and two layers in the example network Karpathy uses, that’s somewhere north of 6.5 million layers in the unrolled network to backpropagate through.

Obviously that’s not going to work :-)

So (and remember that we’re using a batch size of one 5), what you do instead is run through 100 characters, then work out the loss for what you got out, and then do your backpropagation.

That might sound a bit like a normal batch with the length set to the context length in an LLM, but there’s a crucial difference. Once you’ve done your backpropagation, you continue through the training text from exactly the point where you stopped, without clearing out the hidden state. Instead, you detach the hidden state before you continue so that its gradient history is cleared. You do the next 100 characters, and backpropagate and detach again, and so on.

As a result, the second 100-character chunk starts with a non-zero hidden state, which it would not do if we’d just split the whole 3,359,652 sequence into 33,597 completely separate training examples. But the detaching of the hidden state means that each of the two backpropagations only has to deal with 100 times the number of layers in the RNN, which (a) makes things quicker and (b) doesn’t actually matter that much, because vanishing gradients would normally make the backpropagation signal disappear pretty quickly anyway. Indeed, as he mentions, “this can be cut in half with 50 character BPTT at negligible cost in performance”.

And that works in this case. But that’s not a general rule, it’s what he found for these examples. In general, with RNNs you would need to backpropagate over a very deep unrolled neural network – and although in real life you’re not actually unrolling it, what PyTorch is doing is close enough that the vanishing and exploding gradients do occur.

And that, along with the fixed-length bottleneck, is why LLMs, with their fixed depth – essentially constant times the number of attention layers plus some overhead – turned out to be easier to work with.

Wrapping up

So, that’s a wrap on this post about “The Unreasonable Effectiveness of Recurrent Neural Networks” – at least, some thoughts on how the models it describes differ from the LLMs I’ve been learning to date. I hope it was interesting to read, and if you haven’t tried it already, I do suggest having a play with the repo – or, if you’re feeling brave and don’t mind working with ten-year-old Lua ML code, Karpathy’s original repo. I would also say, please do read the post, but I’m sure you already have (and if not, I’m surprised you made it all the way here...)

Let me finish, like Karpathy does, with a sample for an RNN trained on the (tiny 30kB) content of this post – post-itself in the repo, deliberately overfit with validation loss rising:

The RNNs in the sequence of the ware the for the extsting of the network and the frome to the the fard

Well, exactly!

As always, if you’re reading this and know more about it than I do, any comments or corrections would be much appreciated. Oh, and also: if you thought the diagrams were ugly, I agree 100% and would be grateful for any suggestions on tools for automatically generating stuff like that from text – kind of a LaTeX for diagrams, if it exists.

Next time in this series, I’ll post about what I learned while creating my PyTorch implementation. There will be much ado about batching and datasets.

As you can see from his code, he firstly multiplies the old hidden state by a weight matrix to get an “initial draft” of the new state, then uses another weight matrix to project the vector of inputs to the same dimensionality as the hidden state, then adds the two and runs that through his activation function to get the real new hidden state. Then the output is just a (learned) projection of the new hidden state into the output’s dimensionality. ↩ 1.

Lua arrays are one-indexed, so we start with one and go up to n – in Python, of course, it would be more natural to index them 0..(n−1). ↩ 1.

Karpathy mentions the use of temperature in sampling from the probability distribution; that’s not something I’ve covered in my LLM series yet, but it will either come up in the next post in that series, or the next in this one – whichever comes first. ↩ 1.

It can actually collapse to one with three layers. Karpathy’s code has:

self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
y = np.dot(self.W_hy, self.h)
return y

So, imagine a normal neural network with two layers, receiving ireal+shidden inputs. The first layer has a set of shidden neurons that have zero weights for the “normal” inputs, no activation function, and just do the np.dot(self.W_hh, self.h). It also has a second set of shidden neurons, also no activation function, which ignore the hidden state inputs and do the np.dot(self.W_xh, x). Then we have a second layer that adds the two together, with a tanh activation function. Finally we have another layer with no activation function to do the np.dot(self.W_hy, self.h). ↩ 1.

Adding on batches complicates this, of course, and I’ll dig into that in the next post in this series. ↩