World models hallucinations.

Recently, I’ve described what I believe to be a reasonable - even if not necessarily likely - future for real-time rendering.

A few months after the talk, I still believe in it. I believe we barely scratched the surface. We could have much more deeply integrated AI in real-time rendering, both in the runtime and in the content production tooling. At the same time, I don’t think that end-to-end real-time video models will ever supplant traditional rendering, where the latter is already an efficient solution. Real-time solution lives in the balance between quantity, quality, time, and watts - and there is no reason to believe that, in general, AI will strike a better tradeoff between these quantities for problems that we can efficiently solve from fi…

Recently, I’ve described what I believe to be a reasonable - even if not necessarily likely - future for real-time rendering.

But ok, we’re not here to re-thread the above-mentioned talk. Instead, let’s try to see beyond that.

If we can identify two points in (a design) space, we can draw a line between them. Interpolate, extrapolate. That is often an amazing way of generating ideas. Predicting, or creating, the future.

The extremes.

Let’s put point A at traditional game/real-time rendering engines.

Built for efficiency over everything - even today, especially at scale, very little is conceded to anything else. Real-time rendering turns pain into pixels.

Engineering pain, with a competition to save milliseconds that, at least up to the last console generation, held no punches. Spending a month to save a fraction of a millisecond was not crazy.

Art pain - content creation is tied to the specifics of a given engine, triangles and pixels handcrafted to conform to specific algorithms, each game using different strategies.

Harly a game is more "handmade" than The last of us (part 2). A monumental amount of work went into it. Works even on the now old ps4 - around 5 joules of energy per frame there.

Note - importantly - that this extends beyond what happens locally - on players’ machines. The whole end-to-end pipeline is designed to maximize quality/quantity per watt, from servers to asset delivery to rendering.

And even if it is optimized, it’s still not enough, games are still today limited by performance and scalability concerns - we have not reached the point where we can concede much.

Now, let’s place point B the farthest possible from here: generative videos. Real-time "world models" - at least in the weak sense: AI that learned how to display pixels that can be steered through "actions".

In some sense, this is the complete opposite: nothing is solved from first principles, nothing is hardcoded - all is learned. Inference takes gargantuan amounts of compute.

What do we get in exchange for all this inefficiency (well, other than some questionable bragging rights - i.e., it’s cool)? (in)Arguably, content creation efficiency.

An AI fake in the style of TLOU. Generated by Nano Banana pro from a simple prompt of a few words. Estimates of today’s stable diffusion models are hard, but could ballpark at 2000-10000 joules for 1080p.

You write a few words, and the model creates a world. No textures, no triangles, no geometry, no coding - nothing. A whole world hallucinated from a sentence. Very little control - the machine is in the driver’s seat, and it drives fast. The opposite extreme, antipodes.

This, to me, starts to sound interesting. We have our two points - can we interpolate between them?

Walk the line.

Arguably, some small steps are already happening.

Rendering engines are increasingly preoccupied with being able to cope with arbitrary content and fast content iteration, even at the expense of efficiency. This allows AI creation to slot in more; there are very few studios that would not welcome with open arms cheaper content creation, and not for "evil" reasons.

Games have to be cheaper to produce. It has been true for decades now, we always knew... but now costs are slamming studios against a wall, with an industry that is not in "growth mode" anymore, and more and more hours taken by fewer and fewer IPs.

Similar movement is happening at the other end. We want models that are more and more controllable, more and more steerable, interpretable: open to human craft - even at the expense of more work spent to control and steer.

And we want, well, need - efficiency. It’s a mirror, a dual of the traditional gaming world - here, instead of content creation being too expensive, the problem is that running the world simulation is. And similarly, it’s not news. AI companies know, always knew. It’s just that they were (are) still (pre)occupied with scaling by brute force, by expanding the model’s capabilities by spending more, as the race for capabilities can’t stop, efficiency and revenues matter only after the dust has settled, the market has been captured.

And now... kiss.

Right now, we’re still navigating very near the shore. What would it look like to have real-time rendering somewhere in between the two points? What about a quarter off one way? The other way? It’s... exciting to think.

Of course, I have no idea, but let’s just use our imagination.

1/4 A + 3/4 B.

Still fully AI, but taking a few core concepts from traditional engines. I’d prioritize two. First - the idea of an interpretable, directly modifiable state - linked to discrete parts of the world. i.e., a house is a house, and we know of it, we know of its state, we can change it explicitly. Second - the separation of world from view: i.e. a model for world simulation, and a model to "lower" the world into a player/view dependent final image - both for ensuring consistency of the world, and potentially, there might be opportunities to distribute computation differently, with the final "rendering" happening on clients, with lower latency, lower bandwidth and so forth.

3/4 A + 1/4 B.

This could be the space of engines that treat the world as conventional, "hardcoded", discrete objects. It could do large parts of the simulation "by hand" - taking AI only for a final... polish, let’s say.

What if we had a world of objects, perhaps even triangles and images - directly paintable, but that do not represent quantities for hardcoded rendering and simulation, but attributes. High-level material semantics, rough shapes - for AI to turn into final images, perhaps even to help evolve (simulate) in time?

1/2 A + 1/2 B.

Probably the hardest to imagine, but we’re just having fun, I hope.

It could be a world where both languages are spoken, and we can translate between the two. I can create a world from prompts; the world, its state, and simulation, is opaque - but at the same time, the world simulation can speak the language of traditional creation and control. It can produce 3d objects for direct investigation and manipulation. It can attach to these objects a list of human/API interpretable attributes. Perhaps even code - controlling their behaviour.

AI either emits intermediate code and data objects or keeps a live, simultaneous "translation".

Conclusions.

I don’t know. Obviously, this has been only for fun and speculation.

One thing I’m pretty sure of. If we are going to see more evolution along this path of real-time AI, it won’t be something that replaces what game engines do today - they exist already on a pretty settled extreme of some Pareto front.

It would more likely be something that is used in an entirely different way, by different people/professions, to create products and content that are in a different market. That market might enter the same competition for time that all entertainment now is in - if targeted at entertainment, but it won’t be gaming as we intend it today.

Similar Posts