Posted May 15, 2019 by void room
#multithreading
Tea For God is a VR game. This means that it has to render two images for each eye. And it has to do so in 90 frames per second. There are several approaches to rendering and there are more and more being developed. I tried a few and one that worked the fastest for Tea For God was the simplest one, rendering two separate images. Note that for different engines, for different projects there might be different solutions that are the best.
But it’s not a post about rendering. It’s a post about multithreading. If you know a thing or two about multithreading, skip the next few paragraphs. One more thing before we go, initially I was considering breaking it onto a few posts but decided to just have one bigger post.
What is multithreading? A long time ago, I exactly don’t remember when, but it was more than 10 years ago, maybe more like 15 or even 20, most computers had a single core. Most of the games were running single threaded. This means that everything was happening after one another. Every single thing. In general, first the input was read, then game logic was run, everything related to it and then that very frame was rendered and presented to the user.
Then multiple cores came and developers realised that they can run few things at the same time. Imagine that each core is a person and they are given specific tasks. With just a few cores, the easiest approach was to have one frame do rendering, the second do the game logic and if there is another one, it can stream things or maybe do some background tasks. This shift meant that we had to prepare a game frame but we couldn’t render it while we were preparing it, so we only could render a previous frame. So when we present a rendered frame to the player, there might be already another game frame ready. The player sees past. But that is exactly what happens in our brain anyway. What we see is not there really is, but what happened some very short time ago. This happens due to the synchronization of different signals that come from our body. So it’s not such a big deal when there is high enough framerate.
We shifted to rendering a previous frame, but we were still preparing a game frame on one thread. With 2 or 4 cores available, this was not a big deal. But as many computers now have 8 cores and soon will have more, this puts us in a situation when some of the cores remain not used by the game. Some of the developers decided to change that and there are numerous different approaches on how to deal with multithreading.
This post covers how I dealt with multithreading in Tea For God.
In the game, I try to use all available cores (I have a system that tells how many cores can be assigned to a given type of job, each core may do few different things). This required most of the task to be broken into most basic, simple tasks and to keep communication between them. Back to the “cores are people” example. Imagine that two different people want to use the same thing. For example, take a drawing board.
While two different people are able to look at it or read anything from it, they should avoid situations when they both draw or write on it. Sometimes nothing bad will happen, but sometimes they will run into each other. And with this understanding, they can’t use it at the same time. The simplest solution is to have them to queue. But this means that one of them will idle. In my opinion, this is acceptable as long as they don’t idle for too long. “Too long” here means that we are a sole user of something just for the amount of the time that we deal with it. We should not block anyone else while we do something occasionally and we do lots of other stuff.
But this is acceptable only when it happens occasionally. And many things during a game frame happen not just every frame but dozens or hundreds of times during each frame. We just can’t lock everything all the time. But - we can arrange the work in such a manner that in a given period of time, all tasks will just read common data without modifying it and will prepare stuff in separate parts of memory so following tasks could read it and do something else.
And with this idea in mind, I divided a game frame into lots of small steps. This is kind of obvious because the game frame is divided into a lot of steps, but I pushed it a bit more. Divided it into steps that have a common input and separate outputs (and don’t require to know what’s in other tasks output). For example, there is a step that is responsible just for gathering data about collisions. It checks the world, for each object it checks if it collides, how it should move to resolve the collision but doesn’t resolve any collision. This happens in the following steps. The next one prepares movement. Takes into account what AI wants, what player wants and what collisions tell us is possible. But it still does not do any movement. Up to this moment, we don’t modify any general state. All we change is the internal state of objects. What’s more, we don’t change their actual velocity, because that might be used by other objects to determine how to behave. We actually just set up the “next” velocity. When everything is solved, we know how we move and where we move, we... just move. And this is where we modify anything in the world. This is also when some conflicts may occur. But they may happen only sometimes. When we move objects from one room to another. And that’s when I have locks. Because movement between rooms happens rarely (in terms of a frame, with 90 frames per second, with a few dozens of NPCs, some of them standing, some moving within a room, changing rooms happens really rarely).
This should explain it to you, the basic approach of how I deal with a game frame. Now, I will present you a list of all parts, everything that happens during a game frame. The list is quite big at the moment but I want to show you have fragmented the frame is. This is only the game frame, without system reading input, rendering, audio etc:
But this is only the gameplay side of a game frame. We’re still left with rendering, sound and things that do not fall into a game frame.
Rendering and sound are handled in a similar way:
You should now notice, that I try to pack as much work that doesn’t affect the actual state of the world in the first part of game frame advancement. The latter happens on any free core as the render/sound threads have the highest priority. Also, the rendering thread is the main thread. There is a one selected thread for game frame advancement but it only creates tasks/jobs and deals with a few other things that I will soon cover.
It is important to mention that job management takes some time. Switching between jobs/tasks also takes time. That’s why I batch them. Batch size depends on how many jobs there are. The more data to process, the bigger batches. This way I have as low idle time as possible and I also benefit from batches. At some point, I was considering storing the advancement time for each step but it at the moment this is not required. Game frame advancement fits nicely with a huge extra buffer.
There is also one more optimization that I have. Some of the tasks are not advanced every frame. If an object is not visible (actually, if a room the object is in, was not visible for some time), it may skip some advancements, collisions, AI (which is latent anyway), animations. This saves a lot.
When I was working on AAA games, we had to resort to such optimizations when it came to animations. The more distant an object was, the more frames were skipped for it. This led to very strange bugs. It turned out that other parts of the code were dependent on animations advancing each frame. Sometimes it was for particular states, sometimes for particular objects - we had to make sure that in these cases, animations were advanced every frame. To avoid having such bugs that result in disabling the optimization, I decided to use this approach so early in the project time. The performance gain is significant. The game loop time goes down for about 20-40%.
Oh, there’s one more thing similar to this. AI code uses latent functions. This means that for many frames the AI is just waiting. Then does a bit in one frame and waits more. If you have lots of AI characters, you may end up with doing some heavy stuff related to logic only for a very few of them in every frame.
Back to the topic. I showed how game frame advancement, rendering and sound are connected to each other. But this is not everything that is happening in the game. There are also extra tasks that may span over a few frames. Nav-mesh building, various asynchronous tasks.
Nav-mesh building is quite obvious. When I create a room or change something in it, nav-mesh is built. When the nav-mesh is built, we have a request for a new nav-mesh building task, we manage all that at the beginning of a game frame. Before we create scenes and advance game related stuff.
Various asynchronous tasks are anything that takes some time and does not have to be done immediately. These might be a generation of the world, adding details to the rooms, spawning NPCs etc. Most of the work in these tasks does not affect the world. They are completely separate from anything that happens in the game. Most of the time. Because sometimes we have to sync with a game frame. We want to read something from the world or put a newly created object into the game. Any asynchronous task may switch to synchronous mode. Just for a very brief moment. This moment is just after nav related tasks management and before scenes and game stuff. Most of the times, when an asynchronous task wants to do that, it will be the one waiting. The game frame is much more important at actually it just allows an asynchronous task to do something.
There are a few additional mechanisms on top of that. One of them is related to object activation. Because we may have multiple objects or nested ones, we don’t want to activate (put them into the world) one after another but all at once. We queue them and then mark them to be processed. The actual activation is divided into two parts. One, getting objects ready to be activated, this is an asynchronous job and it may require creating new objects (attachments etc). The second part is a very simple synchronous job - to add readied objects to the world in a batch.
Another example of such a system is “delayed object creation”. It just helps to create objects in order (because each object may have sub-objects that should be created together). If we would like to create three different objects and we would add asynchronous tasks for each one AND each one would require to have more asynchronous tasks created (because we want something to be done, but not right now, after we finish doing the current thing), all of them would interweave. To avoid that, object creation tasks are put into the delayed object creation queue and a new object is created only if there are no current asynchronous tasks running/queued.
Very important thing is to remember, that during asynchronous tasks we should not try to access anything that exists in the world, or the world itself. The world is constantly changing and at different periods of time, different things in it are being modified, added, deleted, replaced. That’s why asynchronous tasks run beside the game and hop in during the synchronous window to do something.
Problems I run into? Many. I have lots of concurrency checks to make sure I run either in a synchronous task or in an asynchronous task. I have lots of mechanisms to make sure that something is read/written when it is expected to be read/write. All that stuff is only running in a development mode. As I already mentioned, I also have locks (spinlocks and multi-read/single-write locks) that might result in short waiting times. They are done to make it easier to wrap your head around what’s going on and to avoid adding extra mechanisms to queue stuff, process and distribute.
One of the problems I run into quite early during the development was trying to avoid breaking frame into more and more steps. Especially when I noticed that time required to just get all required objects and process them was getting bigger and bigger and at one point, there was more time spent administrating tasks. I solved that by batching jobs but also selecting and marking objects that require something to be done in a particular step. Ideally, it would be great to avoid having a very little work being done, but if it is not possible to put it somewhere else, it’s better to waste that extra administration time, but have a clean code.
I had one single moment when I was devastated and wanted to give up. It was when I added lots of background objects to the game. Two things happened. The level generation time rose to two minutes. From 10 seconds. And the framerate dropped from much more than 90 fps to 20 fps. I didn’t know which one was worse. And I didn’t want to get rid of all those background objects. Solutions came quite quickly:
First, I decided to share vistas between windows. Most of the times you see same stuff when you look out of the window. Having different light direction taken that is applied during rendering only, helps too. Hey, smoke and mirrors!
Second, I divided everything into “we need that to have the level running” and “this can wait”. This sped up level generation time a lot. It was then just 5 seconds to generate the required content. Everything else, the NPCs, decorations, vistas etc. are created when the level starts. Because the station door takes a while to open and because the player movement speed is limited (how fast can you run?), there is a lot of additional time to create all that stuff. And in the final game, a player may want to buy/sell stuff, craft something etc. This is when I added more complex asynchronous tasks management (before it was just “add asynchronous task”, after that I had the synchronous window, world jobs (asynchronous tasks) and “delayed object creation” queue). Right now, the levels are more complex and there are more NPCs but the shorter level still takes just a few seconds to be created.
I still had to deal with the framerate. That’s when I introduced a “static object” marker. If an object does not change during its lifetime, it’s advanced just once and then left as it is. It can be switched back to an active state, if there is such a need.
I was back to short level generation time and framerate back to more than 90. With lots of additional objects in the scene.
I heard many times that multithreading may result in the worst horrible kind of bugs. And there are a few kinds:
My approach is to try to prevent such things from happening. Clear input and output that don’t mess with each other. Have things separated. For most of the cases, it should be quite easy to do so. The biggest offender here is the gameplay code. Dealing damages etc. You can either queue stuff to process in a separate step or lock. Some of such bugs are easy to repro (with dual wielding, shooting from both guns at once at the same target resulted in damage code running for the same object at two different threads).
One thing that is good and bad at the same time is that with multithreading working properly, having lots of optimizations, you may end up using 30% CPU. Which may make you write some odd code that is not the best, the fastest one. Because why you would care if you still have lots of time to waste? At least make it easy to read. And limit such cases to AI and gameplay. They go way beyond what’s happening right now and sometimes require a bigger picture to understand them. But for example, collision detection code? Single purpose. Detect collisions. Do whatever it takes to make it run as fast as possible.
That’s it. This should give you an insight into how I managed to get the game running at 90fps, avoid common multithreading issues (at least neither me nor anyone who tried the game run into these - yes, I had crashes, but they were not related to multithreading). If you’d like to ask me something, share your experience, please do so. Even if you want to tell that this system is a pure nonsense. Well, it works, but I am sure that there might be better approaches and if not me, others will benefit if they learn about a better one.