Optimizing CPU Processing

Submitted by epreisz on Fri, 02/23/2007 - 02:39.

Our modern CPUs have an amazing ability to process both arithmetic and logic. Games, stress our CPU with physics, culling, and AI and other tasks. Sometimes, our game’s biggest limit in increasing performance is the processing of arithmetic and logic.
Looking back at the last reading, we found that memory is the cause of many idle cycles on our CPU. We can do hundreds, even thousands of operations in the amount of time it takes for us to fetch memory from system memory. In this reading, we will assume that we aren’t limited by our ability to fetch data and instructions.
CPU processing optimizations are the most popular discussion when considering optimization; however, the micro operations that grab every last ounce of processing are probably more suitable for religious arguments than they are for speeding up our games. Practices such as writing assembly code, or other micro optimizations, are usually my last resort for optimizing a game. Remember, many games are slow because of sections of code, such as drivers and firmware, which we can’t fix with a compiler. If the code we wrote is the slowest in our game, we say that the game is application processing bound. Sometimes our game is limited by our drivers or other system calls. We will discuss that in a later reading.

When are we Application Processing bound?

In order to figure out if we are processor bound, we must first determine if we are CPU limited or GPU limited. We can do this by measuring GPU utilization using either PIX, nVPerfHud, or gDEBugger. If our GPU utilization is less than 90%, then it is likely that we are CPU bound. When GPU utilization is less than 90%, it suggests that our GPU is spending 10% or more running an idle process. Why is there idle time? Because our CPU isn’t able to generate commands fast enough to satisfy the power of the GPU.
Once we determine that we are CPU bound, we must figure out what resource on the CPU is limiting our game. A program such as Intel’s v-Tune or AMD’s code analyst will help us with this. I know v-Tune better so let me share with you how I would determine that my game is application bound.
After determining my code is CPU bound, I would run a sampling on my game. The sampling would tell me where my application is spending it’s time. First, v-Tune tells me where on the system level my code is running. If the code is spending its time in the drivers, then we are driver bound. Likewise this is true if an operating system call is your slowest code. If Intelppm.sys is your slowest module (a module is a section of executable code) then that may suggest that multi-threading will speed up your game (but not always).
If your .exe is the slowest module, then we proceed farther. I would then investigate my executable. By looking at each function, I can determine where the game is spending it’s time. Once I determine the slowest function, I will need to figure out if the code is slow because it is a slow function that is called once, or if it is a small section of code that is called many times. You can determine this by using the call graph included in v-Tune since it records parent-child relationships. Now that you have tracked down the slowest code, you will need to do a visual inspection of the code.
What are you looking for? At this point you are going to be limited by arithmetic, logic, or memory. If it appears as though there is an opportunity to be limited by memory, you may want to run a cache miss analysis. V-Tune provides you the ability to track L1, and L2 cache misses as well as other information (depending on your chip). If your slowest code also corresponds with a section of code that has a high number of cache misses, you are memory limited. If not, chances are that you are application processing limited.

Smart Design & Flexibility

I have heard the phrase that suggests the fastest triangle you can draw, is the one you don’t draw; this concept is also true for CPU processing. Consider the following example.
If a function is called 50,000 times a second, what can you do to increase the performance? Maybe you could rewrite the function in assembly? I think a better approach is to find a way to call the function 10,000 times. It’s a very rare occasion when you have reached the end of the road of reducing the processing by using a better algorithm.
The above concept is the driving force behind many sub systems within a game engine. Quadtree’s, Octrees, BSP trees, and occlusion culling are processes that rely on the fact that doing less application processing is faster than doing more. This isn’t always true for memory. Doing less memory fetches from system memory is not as fast as doing many, many fetches from L1 cache. These readings are not about culling; however, it is always important to note that good application design and flexibility is key to having a fast game.
So you may not understand why flexibility is so important. Our systems are very complex and sometimes, you may not know which decisions to make until after you run and test the code. If you code is inflexible, you will be stuck with your first design.

Solutions Application Processing Bound

When your game’s performance is bound by the math or logic that you wrote, then your game is application processing bound. There are many different system, application, and micro level optimizations that will increase the performance of applications bound by their processing. Because we have access to the code of games that are application processing bound, we have the opportunity to rewrite the code directly; however, a bad habit of many game developers is to optimize code that we think is slow, not code that we have determined to be slow with a proper detection phase.

System Level Solutions

When games are limited by application processing we have two system level solutions. One solution, is a technique that greatly increases the difficult of coding – that solution is multi-threading. It is important to note that multi-threading is a solution for instruction and logic processing, but not for applications that are memory bound (at least with the current memory architecture). We will cover multi-threading in a later reading. Another solution involves balancing.

Balancing is when we move code from over-utilized resources to under-utilized resources. It is common for the CPU to be over-utilized, especially with OpenGL and DX9. By moving work from the CPU to the GPU can free your CPU for doing other work that is not well suited for the GPU. Since GPUs are scaling performance faster than CPUs, then the code I write today for the GPU will gain better performance in 18 months than if it were running on the CPU. Animations and particle systems are some common code modules that ported from the CPU to the GPU.

Application Level Solutions

There are hundreds of application level solutions but those are outside the scope of these readings. Why? Because, those resources are available all over the internet. Application level solutions that reduce application processing are what game engines are all about. System such quad trees, oct trees, bsp trees, are all examples of application level solutions to games that are limited by application processing.

There are hundreds of application level solutions but those are outside the scope of these readings. Why? Because, those resources are available all over the internet. Application level solutions that reduce application processing are what game engines are all about. System such quad trees, oct trees, bsp trees, are all examples of application level solutions to games that are limited by application processing.

One concept overlooked in game engine design is memoitzation. Memoitzation, which sounds similar to memorization, is when a function or algorithm saves its result in anticipation of calling that function again with the same parameters. Subsequent calls of the functions perform a look-up to see if the answer has already been stored. Memoitzation is essentially a dynamic look-up table. Lookup tables, described in detail in the next section, are solutions that trade memory for performance; however, with modern cache designs, using too much memory, or accessing it in a non-cache friendly way, may cause cache misses. Trading instruction processing for cache misses is a fools errand.

Since our game runs in a loop, there are many opportunities to use memoitzation. There are many aspects of our game that may not change every frame. For example, a vehicle driving on the terrain is not likely to teleport across terrain tiles. Therefore, we should consider first assuming what leaf node of our culling system the vehicle is in. Only rarely will a vehicle move out of one node and into another.

Micro Level Solutions.

Micro level application processing techniques are the most recognized and discussed techniques in all of optimization. Messages boards and mailing lists are full of people arguing over these techniques.

Micro optimizations are usually easier to add to an engine than system or application optimizations. There are some downsides to micro optimizations, and in my opinion, a worthwhile system or application optimization is more valuable than many micro optimizations.

One of the problems with micro optimizations deals with the amount of PC configurations that exist. The micro optimization that increases performance on one configuration may not increase performance on another. I have heard a programmer say, “it doesn’t crash on my machine”. The same is true of some micro optimizations. You may hear someone say, “it runs fast on my machine”.

Micro optimizations are not as preferred as system and application optimizations, but they do have their place. Below are some common strategies to increasing instruction processing with micro optimizations.

Replacing Strings with Integers

One mistake many entry level programmers sometimes do is to use a string when they should be using an integer. Comparing a string is much slower than comparing an integer. When you compare a string, you must compare many values, each character, to determine the difference between two strings. Since a string requires more memory, it is also going to require more work. A string is a great interface for a human, not for a computer. We need strings for debugging and presenting to a user (like we do when we put the player’s name above their head).

Microsoft’s FX Framework provides a useful example of mapping a string to an integer. When referencing a variable in a shader, you can do so by referring to the variable as a string, or as a handle. A handle, is simply a integer number that is mapped to the shader’s contestant address. The most efficient way to set a variable using the Microsoft FX Framework is to first use the string to generate a handle, then reference use that handle for the remaining lifetime of your application. By doing this, you are replacing a string compare with an integer.

This method is also important in network programming. Instead of sending the string through the network every time a client needs it, register the string on the server with an integer. Then, when you send that string to the client, do so, but also send along the global integer id for that string. From that point on, you only need to send the id across the network and resolve that value on the client.

In-lining Functions

Coming soon…

Look-up Tables

Coming soon…

Assembly

Coming soon…

Loop Unrolling

Coming soon…

Performance of Operations

Latency and Throughput

Coming soon…

Memory Alignment

Coming soon…

SSE, SSE2, SSE3, SSE4 Instructions

Coming soon…

Write Combined Buffers

Coming soon…